0% found this document useful (0 votes)
9 views6 pages

PC#1 Exercises Introduction To NCBI 2020-Solved

The document provides an introduction to the NCBI and its databases, including how to explore the Entrez system and perform searches related to cancer and human sequences. It details the number of records found in various databases for the terms 'cancer' and 'human', as well as the significance of using specific search expressions. Additionally, it covers the use of Batch Entrez for retrieving sequences related to Homo sapiens and includes instructions for downloading data using command line tools.

Uploaded by

marti.diez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views6 pages

PC#1 Exercises Introduction To NCBI 2020-Solved

The document provides an introduction to the NCBI and its databases, including how to explore the Entrez system and perform searches related to cancer and human sequences. It details the number of records found in various databases for the terms 'cancer' and 'human', as well as the significance of using specific search expressions. Additionally, it covers the use of Batch Entrez for retrieving sequences related to Homo sapiens and includes instructions for downloading data using command line tools.

Uploaded by

marti.diez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Practical Session #1: Introduction to NCBI and Entrez databases.

I. Explore the National Center for Biotechnology information (NCBI) website and get familiar
with its design and environment and its major databases.

https://www.ncbi.nlm.nih.gov/

What NCBI is? How many databases are hosted by the NCBI?

59 https://www.ncbi.nlm.nih.gov/guide/all/#databases

Which of the following NCBI databases could be considered as primary databases? Protein,
Nucleotide, CDD, PubMed, Gene, Genomes, Refseq, BioProjects.

Protein, Nucleotide, PubMed, BioProjects. Pubmed would be a special case since is not exactly
“experimental data”, but data is stored “as is” there is no postprocessing to it.

II. The Entrez system

Perform a Global Query at the NCBI through the Entrez using the expression (all[Filter]). Which
database contains the largest number of records?

https://www.ncbi.nlm.nih.gov/entrez/query/static/help/
Summary_Matrices.html#Search_Fields_and_Qualifiers

https://www.ncbi.nlm.nih.gov/genbank/statistics/

https://www.ncbi.nlm.nih.gov/search/all/?term=all[Filter]

Now we are interested to find all the information at the NCBI related to the group of diseases
in humans known as cancer. Type the word “cancer” in the search box on the NCBI homepage
and run the search (Global Query). Note that the query is interpreted differently in different
databases.
How many scientific papers contain this word?
https://www.ncbi.nlm.nih.gov/pubmed/?term=cancer 4180400

How many nucleotide sequences?


https://www.ncbi.nlm.nih.gov/nuccore/?term=cancer 10438437

How many cancer-related functional genomics studies have been stored at NCBI?
https://www.ncbi.nlm.nih.gov/bioproject/?term=cancer 28765

Why does taxonomy database give us one record? (For discussion in class)
https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?
mode=Info&id=6754&lvl=3&lin=f&keep=1&srchmode=1&unlock there is a crustacean from
the family Cancridae
Perform a new Global Query but using the word “human”.

How many entries (records) have been obtained for the different databases?
https://www.ncbi.nlm.nih.gov/search/all/?term=human

Would we get the same results if we perform a Global Query using the search expression
(homo sapiens),
https://www.ncbi.nlm.nih.gov/search/all/?term=homo%20sapiens

(human[organism])
https://www.ncbi.nlm.nih.gov/search/all/?term=human%5Borganism%5D

or (homo sapiens[organism])? https://www.ncbi.nlm.nih.gov/search/all/?term=homo+sapiens


%5Borganism%5D

Why?
It looks like in some databases human is a clear alias of “homo sapiens”, but not in all of them.
(eg: Pubchem databases)

When the [organism] is included this term is only looked in the field “organism” of the
database. Some of them use controlled vocabularies, some not.

Beware that results in main screen not necessarily matches results in the database (when we
click the link) eg: taxonomy.

How is the expression [organism] interpreted by each database?

It depends if the expression is recognized.

If you are interested in studying human cancer, which of the following strategies would
produce a more useful set of results in a Global Query at the NCBI?

cancer AND human

cancer[organism] AND human

cancer AND human[organism]

cancer OR human[organism]
Cancer AND human search both terms in any place we can find cases like:
https://www.ncbi.nlm.nih.gov/gene/39645575 a gene from a “Klebsiella pneumoniae” isolated
from a cancer patient.

III. At NCBI each record is assigned a UID “unique integer identifier” for internal tracking. In
sequence databases this unique identifier is also known as the Accession number.

What NCBI database the following UIDs belong to?

CM000253.1 GeneBank Nucleotide

NG_011877.1 RefSeq Nucleotide

SRX4644664 SRA

NP_002266.2 Refseq protein

CP027442.1 GeneBank Nucleotide (take care with the genome entry)

PRJNA490405 BioProject

CAB37359.1 GeneBank protein

ADE87724.1 GeneBank protein

IV. Open the NCBI entry with accession number NG_011877 and get familiar with the format
and the different fields used to store sequence information. This will open in the GenBank
Flat File Format.

What does this entry represent? Do you think this entry provides cross-references (links) to
other databases? From which organism this sequence was obtained? What is the UID or
identifier for this organism in the Taxanomy database? What does the underscore “_” in the
accession number stand for? Display the entry in FASTA format. What happened?

V. In the Taxonomy database explore all the information related to the organism Homo
sapiens.

Look at the lineage for this taxon. What order do humans belong to? Primates What is the txid
for this mammalian order? 9443

How many human protein sequences are there today at the NCBI? 1421783

VI. Advance searches

With which of these strategies will you find all the human sequences stored in the nucleotide
database at the NCBI?

A. txid9606[Primary organism]

B. homo sapiens[Primary Organism]

C. homo sapiens[porgn]
D. human[porgn]

They are al synonyms. In this case controlled vocabulary works.

VI. Batch Entrez

Use Batch Entrez to upload a file of GIs or accession numbers from the Nucleotide or Protein
databases, or upload a list of record identifiers from other Entrez databases. Batch Entrez will
download automatically the corresponding records.

In this exercise we will retrieve from the NCBI database all sequences related to Homo sapiens
tumor protein 53 (TP53) published on a paper with PubMed accession number PMC3675194.
This flat text file has a list of the accession numbers referenced in this paper.

1. Save the text file locally in your computer.


2. Open Batch Entrez.

https://www.ncbi.nlm.nih.gov/sites/batchentrez

3. Select the database from which the list of accessions will be queried.
4. Use the “Browse” button to select the filename containing the list of idetifiers from
your system directory.
5. After pressing the “Retrieve” button you will see a list of record summaries. Retrieve
them!
6. Optionally, select a format in which to display the data for viewing, and/or saving.
Select “Send to file” to save the file.

How many records are on the list? 79

From what database the entries belong to?nucleotide

Do all entries represent human sequences? Yes


grep "Homo sapiens" nuccore_result.txt |less -NS

Do all entries represent mRNA sequences? Yes


grep " bp " nuccore_result.txt |grep "mRNA"|less

Do all sequences belong to the same human subject? No

Do all sequences have the same length? No


grep " bp " nuccore_result.txt |sort -n |less

If it does not work you can obtain the same results with this link:
https://www.ncbi.nlm.nih.gov/nuccore/?term=KC820708:KC820786[pacc]
You can download all this sequences as fasta files using e-utilities, under linux / OSX / Cygwin,
etc:

1) Download the PMC3675194-List_IDs.txt file


2) Go to the downloaded directory in a command line shell (eg: bash)
3) Execute the following 2 commands:
a. dos2unix PMC3675194-List_IDs.txt
b. cat PMC3675194-List_IDs.txt |xargs -tI% wget -O %.fasta
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=
%&rettype=fasta&retmode=text

Note that the command “b” starts with “cat” and ends with “text”. The command “a” is
important, you should know why.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy