0% found this document useful (0 votes)
169 views28 pages

Search Pubmed With R Part1Part2

R is a free software environment for statistical computing, data manipulation, calculation and graphical display. The associated Bioconductor project provides many additional R packages for statistical data analysis in different life science areas.

Uploaded by

cpmarqui
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
169 views28 pages

Search Pubmed With R Part1Part2

R is a free software environment for statistical computing, data manipulation, calculation and graphical display. The associated Bioconductor project provides many additional R packages for statistical data analysis in different life science areas.

Uploaded by

cpmarqui
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Search Pubmed with R Part1 and Part2

R Project

R is a free software environment for statistical computing, data manipulation, calculation and graphical display (1,2) For those interested, the associated Bioconductor project provides many additional R packages for statistical data analysis in different life science areas, such as tools for microarray, next generation sequence and genome analysis. The R software is free and runs on all common operating systems (2-4). Facilitates the inclusion of biological metadata from literature data such as PubMed. Provides access to powerful statistical and graphical methods.
References:
1- The R Project for Statistical Computing: http://www.r-project.org/ 2- W. N. Venables, D. M. Smith and the R Development Core Team. An Introduction to RNotes on R: A Programming Environment for Data Analysis and Graphics. Version 2.14.2 (2012-02-29). 3-R & Bioconductor Manual. Author: Thomas Girke, UC. Riversidehttp://manuals.bioinformatics.ucr.edu/home/R_BioCondManual#TOC-R-Basics 4- Bioconductor: http://www.bioconductor.org/

Install R
1- Install the latest release of R according to instructions provided in The R Project for Statistical Computing- http://www.r-project.org/ 2- Onced installed, open the R command window (R console) 3- In the R Console the > prompt in red color is where you type the commands. 4- Any text or comment in R beginning with the hash # symbol is ignored.

References 1- The R Project for Statistical Computing: http://www.r-project.org/ 2- Bioconductor: http://www.bioconductor.org/ 3-R Tutorials. W.B. King. 2010. http://ww2.coastal.edu/kingw/statistics/R-tutorials/preliminaries.html

Install packages in R
1- In the R Console type the following in the R command window to connect to Bioconductor and install packages: source("http://bioconductor.org/biocLite.R") 2- request instalation of the package type: biocLite() 3- Install packages, "RISmed" , and "tm" by typing (see next slide) : biocLite(c("RISmed", "tm")) 3- Install package "ggplot2" -type: biocLite( "ggplot2")) Package RISmed is to download content from NCBI databases. Package tm is for text mining functionalities Package ggplot2 is for data visualization
References 1- Bioconductor: http://www.bioconductor.org/ RISmed package: Stephanie Kovalchik (2013). RISmed: Download content from NCBI databases. R package version 2.1.0. http://CRAN.R-project.org/package=RISmed tm package: Ingo Feinerer and Kurt Hornik (2013). tm: Text Mining Package. R package version 0.5-8.3. http://CRAN.R-project.org/package=tm ggplot2 package: H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009. http://had.co.nz/ggplot2/book also http://cran.r-project.org/web/packages/ggplot2/index.html

The R Console

Query pubmed titles for oncolytic virus using RISmed


Type the following in the R console: library(RISmed) onc<- EUtilsSummary("oncolytic virus[Majr]") onc # [1] "\"oncolytic viruses\"[MeSH Major Topic]" fetch.onc <- EUtilsGet(onc) fetch.onc # PubMed query: "oncolytic viruses"[MeSH Major Topic] Records: 713 onc.tit<-ArticleTitle(fetch.onc) onc.tit <-unlist(onc.tit) # export title results as text file write(onc.tit, file="title_oncolytic_virus.txt")

Query pubmed MESH topic for oncolytic virus using RISmed


# Continue to type in the R console the following: mh<-Mesh(fetch.onc) mh.per.row<- lapply(1:length(mh), function(i){ mh.df.rbind = as.data.frame(do.call(rbind, Mesh(fetch.onc)[i])) mh.per.row<-paste(mh.df.rbind$Heading, collapse= ";") }) mh.list<-unlist(mh.per.row) # The following is to export mesh results as text file write(mh.list , file="mesh_oncolytic_virus.txt")

View results in excel


# export both title and mesh results as text file to view as table with excel tit.mh<-cbind(onc.tit, mh.list) tit.mh[1:10,] # view first 10 results write.table(tit.mh, file="tit_mesh_oncolytic_virus.txt ", row.names=F, sep="\t") # !!open file in excel

Column containing titles

Column containing corresponding Mesh terms

Preparing forText Mining Analysis

Type getwd() in the R console to display the R working directory. In my case: [1] "C:/Documents and Settings/PMarqui/My Documents" Now create a new folder in the R working directory and give a name to it (for ex. OncolyticVirus) Use the new folder to place two of the recently created text files: title_oncolytic_virus.txt and mesh_oncolytic_virus.txt Start the Text Mining Analysis

Text Mining Analysis


# Type the following in the R Console library(tm) #loads the text mining package my.corpus<-Corpus(DirSource("OncolyticVirus"), readerControl=list(reader=readPlain)) # Note that "OncolyticVirus" refer to the name of the newly created folder. In my.corpus<-Corpus(DirSource(" you must use the name given to the folder containing the 2 text files my.corpus <- tm_map(my.corpus, stripWhitespace) # Removes extra
whitespace

my.corpus <- tm_map(my.corpus, gsub, pattern="[^[:alnum:][:space:]]", replacement=" ") # remove punctuation except dash
"-"

# my.corpus <- tm_map(my.corpus, removeNumbers) # Removes


numbers- optional

Text Mining Analysis


# Continue and type the following code in the R Console:

my.corpus <- tm_map(my.corpus, tolower) #Conversion to lower case letters my.corpus <- tm_map(my.corpus, removeWords, stopwords("english")) # Removes stopwords my.corpus <- tm_map(my.corpus, stemDocument) # removes suffixes from
words to get common origin Document matrix

my.corpus.matrix<-TermDocumentMatrix(my.corpus) # Creates a Termmat.my.corpus<- as.matrix(my.corpus.matrix) # Creates a matrix my.corpus.df<-as.data.frame(mat.my.corpus) # Create data frame from
matrix displaying all the terms in any of the 2 documents.

my.corpus.df[200:250,1:2] # view some of the terms copy.my.corpus.df<-my.corpus.df # make a copy of my.corpus.df


data frame for later

to keep original

Text Mining Analysis


# Continue and type the following code in the R Console:
#sort the most freq mesh term in the data frame my.corpus.df<- my.corpus.df[

order(my.corpus.df$mesh_oncolytic_virus.txt, decreasing = T),]

# assign the 50 most freq mesh term to xx

xx<- my.corpus.df[1:50,]

# view the top 5 most freq mesh term- to view you can also use "head( xx,5)" both are equivalent xx[1:5,] #sort the 50 most freq mesh term in increasing order (for plot visualization) xx<- xx[ order(xx$mesh_oncolytic_virus.txt, decreasing = FALSE),]

Text Mining Analysis


# Continue and type the following code in the R Console:
# Plot the 50 most frequent mesh terms use library ggplot2 library(ggplot2)

Terms<- rownames(xx) Mesh.count<-xx$mesh_oncolytic_virus.txt ggplot(xx) + geom_point(aes(Terms, Mesh.count ), stat = "identity", fill = "darkblue")+ coord_flip() + theme_bw() p1<-last_plot() + scale_x_discrete(limits=(Terms)) p1

Text Mining Analysis


VIEW the 50 most frequent mesh term

Part 2

Text Mining Analysis


#

# Continue and type the following code in the R Console: now select the most freq title term. Therfore sort title in decreasing order my.corpus.df<- my.corpus.df[ order(my.corpus.df$title_oncolytic_virus.txt, decreasing = T),] xy<- my.corpus.df[1:50,] # assign the 50 most freq title term to xy xy[1:5,] # view the top 5 most freq title term

#sort the 50 most freq title term in increasing order (for plot visualization) xy<- xy[ order(xy$title_oncolytic_virus.txt, decreasing = FALSE),] # Plot the 50 most frequent title terms require(ggplot2)

Terms<- rownames(xx) Title.count<-xy$title_oncolytic_virus.txt


ggplot(xy) + geom_point(aes(Terms, Title.count ), stat = "identity", fill = "darkblue")+ coord_flip() + theme_bw()

p2<-last_plot() + scale_x_discrete(limits=(Terms)) p2

Text Mining Analysis


VIEW the 50 most frequent title term

Text Mining Analysis


# Continue and type the following code in the R Console: Create separate data frames for each frequency type

my.corpus.sub1.df<- subset(my.corpus.df, mesh_oncolytic_virus.txt>0 & title_oncolytic_virus.txt>0) # subset common terms in the 2 documents my.corpus.sub1.df[200:300,1:2] # view some of the subset terms my.corpus.sub2.df<- subset(my.corpus.df, mesh_oncolytic_virus.txt==0 & title_oncolytic_virus.txt>0) # terms present in title but not in mesh my.corpus.sub2.df[200:300,1:2] # to view some terms (200-300) my.corpus.sub3.df<-subset(my.corpus.df, mesh_oncolytic_virus.txt>0 & title_oncolytic_virus.txt==0) # terms present in mesh but not in title my.corpus.sub3.df[200:300,1:2] # view some of the terms

#CORRELATE terms in title and mesh cor(my.corpus.df$title_oncolytic_virus.txt, my.corpus.df$mesh_oncolytic_virus.txt) # correlation coefficient is [1] 0.4442518

Text Mining Analysis


# bellow generates a term frequency vector from a text document termFrequency <-rowSums(as.matrix(my.corpus.matrix)) my.tdm <- TermDocumentMatrix(my.corpus, control = list(minWordLength = 1)) my.tdm #A term-document matrix (2632 terms, 2 documents) # bellow is to select those terms from term-document matrix which occur at least 100 times findFreqTerms(my.tdm[,1], lowfreq=100) findFreqTerms(my.tdm[,2], lowfreq=100)

For part 2

Text Mining Analysis


# Code for plot 3: most frequent title terms with the corresponding mesh terms my.corpus.df<- my.corpus.df[ order(my.corpus.df$title_oncolytic_virus.txt, decreasing = T),] xy<- my.corpus.df[1:50,] # assign the 50 most freq title term to xy #sort the 50 most freq title term in increasing order (for plot visualization) xy<- xy[ order(xy$title_oncolytic_virus.txt, decreasing = FALSE),]

# Plot the 50 most frequent title terms and the corresponding mesh terms included in the 50 most frequent title terms

Terms<- rownames(xx) Title.count<-xy$title_oncolytic_virus.txt Mesh.count<-xy$mesh_oncolytic_virus.txt

ggplot(xy, aes(Terms)) + geom_point(aes(y = Mesh.count, colour = "Mesh.count")) + geom_point(aes(y = Title.count, colour = "Title.count"))

p3<-last_plot() + coord_flip() p3<-last_plot() + scale_x_discrete(limits=(Terms)) p3

plot 3: most frequent title terms with the corresponding mesh terms

Text Mining Analysis

Text Mining Analysis


# Code for plot 4: most frequent title terms and

most frequent mesh terms top50.mh.ti<-rbind(xx,xy) # combine top 50 mesh and title terms Terms<- rownames(top50.mh.ti) # assign rownames to Terms msh<-top50.mh.ti$mesh_oncolytic_virus.txt titl<- top50.mh.ti$title_oncolytic_virus.txt p4 <- ggplot(top50.mh.ti) p4 <- p4 + geom_text(aes(x = msh, y = titl, label = Terms)) p4

Text Mining Analysis


plot 4: most frequent title terms and most frequent mesh terms

Text Mining Analysis


my.corpus.df<- my.corpus.df[ order(my.corpus.df$title_oncolytic_virus.txt, decreasing = T),] xy<- my.corpus.df[1:50,] # assign the 50 most freq title term to xy xy[1:5,] # view the top 5 most freq title term #sort the 50 most freq title term in increasing order (for plot visualization) xy<- xy[ order(xy$title_oncolytic_virus.txt, decreasing = FALSE),] top50.mh.ti<-rbind(xx,xy) # combine top 50 mesh and title terms top50.mh.ti$Term<-rownames(top50.mh.ti) rownames(top50.mh.ti$Term) = NULL colnames(top50.mh.ti)[1] <- "msh" # change col name colnames(top50.mh.ti)[2] <- "title" # change col name

# plot 5: most frequent title terms and most frequent mesh terms

Text Mining Analysis

# plot 5: most frequent title terms and most frequent mesh terms library("reshape2")

# library("reshape2") is used to transform wide format data by means of the melt function. The melt function takes data in wide format and stacks a set of columns into a single column of data.

top50.melt<- melt(top50.mh.ti, measure.vars = c("title", "msh")) top50.melt p <- ggplot(top50.melt, aes(top50.melt$Term, top50.melt$value, colour = variable)) + geom_point() + coord_flip() p5<-last_plot() + scale_x_discrete(limits=(top50.melt$Term)) p5

Reference for reshape package: Hadley Wickham (2007). Reshaping Data with the reshape Package. Journal of Statistical Software, 21(12), 1-20. URL http://www.jstatsoft.org/v21/i12/.

Text Mining Analysis


# plot 5: most frequent title terms and most frequent mesh terms

p5 <- ggplot(top50.melt, aes(top50.melt$Term, top50.melt$value, colour = variable)) + geom_point() + coord_flip() p5

plot 5: most frequent title terms and most frequent mesh terms

Text Mining Analysis

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy