1 Text Mining Review Slides
1 Text Mining Review Slides
TECHNIQUES, TOOLS,
ONTOLOGIES AND SHARED TASKS
14 Spring
Introduction
Text mining, also referred to as text data mining, refers
to the process of deriving high quality information from
text.
Text mining is an interdisciplinary field that draws on
information retrieval, data mining, machine
learning, statistics and computational linguistics.
Text mining techniques have been applied in a large
number of areas, such as business intelligence,
national security, scientific discovery (especially life
science), social media monitoring and etc..
2
Introduction
In this set of slides, we are going to cover:
the most commonly used text mining
techniques
Ontologies that are often used in text mining
Open source text mining tools
Shared tasks in text mining which reflect the
hot topics in this area
A research case which applies text mining
techniques to solve a healthcare related
problem with social media data.
3
TEXT MINING
Text Classification
TECHNIQUES
Sentiment Analysis
Topic Modeling
Named Entity Recognition
Entity Relation Extraction
4
Text Classification
Text Classification or text categorization is a problem
in library science, information science, and computer
science. Text classification is the task of choosing
correct class label for a given input.
Some examples of text classification tasks are
Deciding whether an email is a spam or not (spam
detection) .
Deciding whether the topic of a news article is from a fixed
list of topic areas such as sports, technology, and
politics (document classification).
Deciding whether a given occurrence of the word bank is
used to refer to a river bank, a financial institution, the act of
tilting to the side, or the act of depositing something in a
5
financial institution (word sense disambiguation).
Text Classification
(a) During training, a feature extractor is used to convert each input value to a feature set. These feature sets,
which capture the basic information about each input that should be used to classify it, are discussed in the
next section. Pairs of feature sets and labels are fed into the machine learning algorithm to generate a model.
(b) During prediction, the same feature extractor is used to convert unseen inputs to feature sets. These
6
feature sets are then fed into the model, which generates predicted labels.
Text Classification
Common features for text classification include:
bag-of words (BOW), bigrams, tri-grams and part-ofspeech(POS) tags for each word in the document.
The most commonly adopted machine learning
algorithms for text classifications are nave Bayes,
support vector machines, and maximum
entropy
classifications.
Algorithm
Language
Tools
Java
Python
C++
Support Vector
MatLab
Machines
Java
Java
Maximum entropy
Python
Nave Bayes
Sentiment Analysis
Sentiment analysis (also known as opinion mining) refers
to the use of natural language processing, text analysis
and computational linguistics to identify and extract
subjective information in source material.
The rise of social media such as forums, micro blogging
and blogs has fueled interest in sentiment analysis.
Online reviews, ratings and recommendations in social media sites
have turned into a kind of virtual currency for businesses looking to
market their products, identifying new opportunities and manage
their reputations.
As businesses look to automate the process of filtering out the
noise, identifying relevant content and understanding reviewers
opinions, sentiment analysis is the right technique.
8
Sentiment Analysis
description
Polarity
classifying a given text at the document,
Classificatio sentence, or feature/aspect level into
n
positive, negative or neutral
Affect
Analysis
Approaches
lexicon based
scoring
machine
learning
classification
lexicon based
scoring
machine
learning
classification
lexicon based
scoring
machine
learning
classification
Named entity
recognition +
entity relation
detection
Named entity
lexicons/
algorithms
SentiWordNet,
LIWC
SVM
WordNet-Affect
SVM
SentiWordNet,
LIWC
SVM
SentiWordNet,
LIWC, WordNet
SVM
9
SentiWordNet,
Topic Modeling
10
11
12
Model/Algorith Languag
m
e
Author
Latent Dirichlet
C
D. Blei
allocation
Supervised topic
models for
C++
C. Wang
classification
Notes
This implements variational inference
for LDA.
Implements supervised topic models
with a categorical response.
lda
R package for
Gibbs sampling
in many models
J. Chang
tmve
Topic Model
Visualization
Engine
Python
A. Chaney
dtm
Dynamic topic
models and the
influence model
C++
Correlated topic
C
models
LDA,
Mallet
Java
Hierarchical LDA
LDA, Labeled
Stanford topi LDA, Partially
Java
c modeling to Labeled LDA
ctm-c
S. Gerrish
D. Blei
A.
McCallum
Stanford
Implements LDA, Labeled LDA, and
13
NLP Group PLDA
Type
Sample
Categories
Example
People
Individuals, fictional
Characters
Organization
Companies, parties
Location
Geo-Political
Countries, states,
provinces
Facility
Bridges, airports
Advantage
Knowledgebased
approach
Require little
training data
Machine
learning
approach
- Conditional
Random Field
(CRF)
- Hidden
Markov Model
(HMM)
Reduced
human effort
in
maintaining
rules and
dictionaries
Disadvantage
Creating lexicon
manually is timeconsuming and
expensive;
encoded
knowledge might
be importable
across domains.
Tools /Ontology
General Entity Types
WordNet
Lexicons created by experts
Medical domain:
GATE (University of Sherfield)
UMLS (National library of Medicine)
15
Method
Advantage
Disadvantag
e
Tools
Cooccurrence
Analysis
Simplicity and
flexibility; high
recall
Low precision;
cant decide
relation types
Rule-based
approaches
General, flexible;
Lower
portability across
different
domains
Manual
encoding of
syntactic and
semantic rules
Syntactic
information:
Stanford Parser;
OpenNLP;
Semantic
information:
Domain
Knowledge bases
Supervised
Learning
Feature-based
methods: feature
representation
Kernel-based
methods:
Little or no
manual
development of
rules and
templates
Annotated
corpora is
required.
Dan Bikels
parser;
MST parser;
Stanford parser;
17
SVM classifier:
Classifier 2:
Classify the
relation types
Sentences
Text Analysis
(POS, Parse Trees)
Classifier
Kernel
Function
Kernel based methods
18
Presence of particular
constructions in a constituent
structure
Headwords
Dependency-tree paths
Constituent-tree paths
Tree distance between the
19
arguments
Dependency tree
kernel
Word, POS,
Generalized POS,
Chunk tag, Entity
Type, Entity level
Word, POS,
Generalized POS,
Entity Type
20
Ontology
Creator
Princeton University
Description
Application
Word sense
disambiguation
Text summarization
Text similarity analysis
Sentiment analysis
Sentiment analysis
Affect analysis
Deception detection
WordNet
WordNet is an online lexical database in
which English nouns, verbs, adjectives and
adverbs are organized into sets of
synonyms.
Each word represents a lexicalized concept.
Semantic relations link the synonym sets
(synsets).
WordNet
Six semantic relations are presented in WordNet because they apply broadly
throughout English and because a user need not have advanced training in
linguistics to understand them. The table below shows the included
semantic relations.
Semantic Relation
Syntactic Category
Examples
Synonymy
(similar)
Pipe, tube
Rise, ascent
Sad, happy
Rapidly, speedily
Antonymy
(opposite)
Adjective, Adverb
Wet, dry
Powerful, powerless
Rapidly, slowly
Hyponymy
(subordinate)
Noun
Maple, tree
Tree, plant
Meronymy
(part)
Noun
Brim, hat
Ship, fleet
Troponomy
(manner)
Verb
March, walk
Whisper, speak
Entailment
Verb
Drive, ride
Divorce, marry
SentiWordNet
SentiWordNet
In SentiWordNet, different senses of the same
term may have different opinion-related
Search
properties.
term
Sense 1
Positivity,
objectivity and
negativity score
Sense 3
Sense 2
Synonym of
estimable in
this sense
The figure above shows the visualization of opinion related properties of the term estimable
in SentiWordNet (http://sentiwordnet.isti.cnr.it/search.php?q=estimable).
25
LIWC results
from personal
text and formal
writing for
comparison
28
Name
Creator
McKusick-Nathans Institute
of Genetic Medicine
Johns Hopkins University
Annotate human
genes
Identify clinical
30
terms
Description
All of the organisms in public sequence
database
College of American
Systematized Nomenclature
Pathologists
of Medicine--Clinical Te
Application
Identify
organisms
Identify terms in
anatomy
Gene product
annotation
Cover terms in
biomedical
literature
MetaMap (Java)
Extracts UMLS concepts from text
Variable length of input text
Outputs a ranked listed of UMLS concepts associated with
input text
31
MedEffect
MedEffect is the Canada Vigilance Adverse Reaction
Online Database, which contains information about
suspected adverse reactions to health products.
Report submitted by consumers and health professionals
Containing a complete list of medications, adverse
reactions and drug indications (medical conditions for legit
use of medication)
Consumer Health
Vocabulary (CHV)
Consumer Health Vocabulary (CHV) is a lexicon linking
UMLS standard medical terms to health consumer
vocabulary.
Laypeople have different vocabulary from healthcare professionals
to describe medical problems.
CHV helps to bridge the communication gap between consumers
and healthcare professionals by mapping the UMLS standard
medical terms to consumer health language.
Name
Antelope
framework
Apertium
ClearTK
cTakes
DELPH-IN
Factorie
FreeLing
General
Architecture
for Text
Engineering
(GATE)
Graph
Expression
Main Features
Part-of-speech tagging, dependency parsing, WordNet
lexicon
Machine translation for language pairs from Spanish,
English, French, Portuguese, Catalan and Occitan
Wrappers for machine learning libraries(SVMlight,
LibSVM, OpenNLP MaxEnt) and NLP tools (Snowball
Stemmer, OpenNLP, Stanford CoreNLP)
Languag
e
C#,VB.n
Proxem
et
C+
(various)
+,Java
The Center for
Computational
Language and
Java
Education
Research at
theUniversity of
Colorado Boulder
Creators
LISP,C+
+
Java
C++
Children's
Hospital Boston,
Mayo Clinic
Websit
e
[1]
[2]
[3]
[4]
Deep Linguistic
Processing
[5]
withHPSGInitiati
ve
University of
Massachusetts
[6]
Amherst
Universitat
Politcnica de
[7]
Catalunya
Java
GATE open
source
community
[8]
Java
Startup huti.ru
[9]
36
Name
Main Features
Languag
e
Creators
Website
Java
Cognitive
Computation
Group at UIUC
[10]
LingPipe
Java
Alias-i
[11]
Mahout
Java
Online
community
[12]
Mallet
Java
University of
Massachusetts
Amherst
[13]
Java
Java
MontyLingua
Python,
Java
MIT
[16]
Natural
Language
Toolkit(NLTK)
Python
Online
community
[17]
NooJ(based
onINTEX)
.NET
Framewo
rk-based
University of
FrancheComt,France
[18]
MetaMap
National Library
of Medicine
UCLA Medical
Imaging
Informatics (MII)
Group
[14]
[15]
37
Name
Main Features
OpenNLP
Pattern
PSI-Toolkit
Adam Mickiewicz
University in
Pozna
[21]
Scala
[22]
Java
The Stanford
Natural Language
Processing Group
[23]
C++
University of
Cambridge,
University of
Sussex
[24]
[25]
[26]
Stanford NLP
Treex
Website
[20]
Text
Engineering
Software
Laboratory
(Tesla)
Creators
Tom De Smedt,
CLiPS,University
of Antwerp
ScalaNLP
Natural
Java
Rasp
Language
Java
University of
Cologne
Machine translation
Perl
Charles University
[27]
in Prague
38
Languag
e
Name
Main Features
UIMA
Java/C+
Apache
+
[28]
NLP++ /
compiles
to C++
[29]
VisualText
Creators
Text Analysis
International, Inc
Java/C+
OW2
+
Website
[30]
UniteX
Java&C
++
Laboratoire
d'Automatique
Documentaire et
Linguistique
[31]
The Dragon
Toolkit
Java
Drexel University
[32]
Text Extraction,
Annotation and
Retrieval
Toolkit
Ruby
Louis Mullie
[33]
Zhihuita NLP
API
Zhihuita.org
[34]
39
40
Introduction
Shared task series in Nature Language Processing often represent a
community-wide trend and hot topics which are not fully explored in the
past.
To keep up with the state-of-the-art techniques and new research topics in
NLP community, we explore major conferences, workshops, special
interest groups belonging to Association for Computational Linguistics
(ACL).
We organize our findings into two categories: ongoing shared tasks and
watch list.
Ongoing list contains competitions that have already made task descriptions, data and
schedules for 2014 publicly available.
International Workshop on Semantic Evaluation (SemEval)
CLEF eHealth Evaluation Lab
Watch list contains competitions that havent made content available but are relevant
to our interests.
41
SemEval
Overview
SemEval, International Workshop on Semantic Evaluation,
is an ongoing series with evaluation of computational
semantic analysis systems. It evolved from the SensEval
(word sense evaluation) series.
SIGLEX, a Special Interest Group on Lexicon of the
Association for Computational Linguistics, is the umbrella
organization for the SemEval.
SemEval- 2014 will be the 8th workshop on semantic
evaluation. The workshop will be co-located with the 25th
International Conference on Computational Linguistics
(COLING) in Dublin, Ireland.
42
SemEval
Past workshops
Workshop
No. of
Tasks
Areas of study
Senseval1(1998)
Senseval2(2001)
12
Senseval3(2004)
16
Logic Form Transformation, Machine Translation (MT) Evaluation, Basque, Catalan, Chinese, English,
Semantic Role Labeling, WSD
Italian, Romanian, Spanish
19
SemEval-2010
18
SemEval-2012
Chinese, English
14
SemEval-2007
SemEval-2013
43
Tas
k ID
SemEval-2014
Task Name
Evaluation of
compositional
distributional
semantic models
(CDSMs) on full
sentences
Description
Subtask A: predicting the degree of
relatedness between two sentences
Subtask B: detecting the entailment
relation holding between them
L2 writing
assistant
1:
2:
3:
4:
Aspect
Aspect
Aspect
Aspect
term extraction
term polarity
category detection
category polarity
Data
10,000 English sentence pairs, each
annotated for relatedness score in meaning
and the entailment relation (entail,
contradiction, and neutral) between the two
sentences.
Training data will cover two domains: air
travel and tourism.
The data will be available in two languages:
Greek and English.
Task
ID
SemEval-2014
Task Name
Description
Data
In trial data, each natural language
command is annotated into robot
command.
"Move the blue block on top of the grey
block." is labeled as
(event: (action: move) (entity: (color: blue)
(type: cube)) (destination: (spatial-relation:
(relation: above) (entity: (color: gray) (type:
cube)))))
Spatial Robot
Commands
Analysis of
Clinical Text
Broad-Coverage
and CrossFramework
Semantic
Dependency
Parsing
Sentiment
Analysis for
Twitter
SemEval-2014
Important Dates
46
Overview
48
Tas
k
ID
Task
VisualInteractive
Search and
Exploration
of eHealth
Data
Usercentered
health
information
retrieval
Description
Data
Subtask A: monolingual information retrieval taskretrieve the relevant medical documents for the
user queries
Subtask B: multilingual information retrieval task German, French and Czech.
CoNLL
Overview
CoNLL, the Conference on Natural Language Learning is a yearly
meeting of Special Interest Group on Nature Language Learning
(SIGNLL) of the Association for Computational Linguistics (started
from 1997).
Since 1999, CoNLL has included a shared task in which training and
test data is provided by the organizers which allows participating
systems to be evaluated and compared in a systematic way.
Description of the systems and evaluation of their performances
are presented both at the conference and in the proceedings.
The last CoNLL was held in August 2013, in Sofia, Bulgaria, Europe.
Information about CoNLL 2014 and its shared task will be released
in next month.
51
CoNLL
Recent shared tasks from CoNLL
Year
Task
Data
National University of Singapore
Corpus of Learner English
(NUSCLE)
Language
English
Arabic,
Chinese,
English
English
English
English,
Catalan,
Chinese,
Czech,
German,
Japanese and
52
Spanish
*SEM
Overview
Joint Conference on Lexical and Computational Semantics
(*SEM), started from 2012, is organized by Association for
Computational Linguistics (ACL) Special Interest Group on
Lexicon (SIGLEX) and Special Interest Group on Computational
Semantics (SIGSEM).
The main goal of *SEM is to provide a stable forum for
researchers working on different aspects of semantic processing.
Every *SEM conference includes a shared task in which training
and test data are provided by the organizers, allowing
participating systems to be evaluated and compared in a
systematic way. *SEM 2014 will release information about
shared task in Dec. or early Jan. 2014.
53
*SEM
*SEM 2012 shared task:
Description: Resolving the scope and the focus of
negation
Data: Stories by Conan Doyle, and WSJ PropBank Data
(about 8,000 sentences in total). All occurrences of
negation, their scope and focus are annotated.
54
BioNLP
Overview
BioNLP shared tasks are organized by the ACLs
special Interest Group for biomedical natural
language processing.
BioNLP 2013 was the twelfth workshop on
biomedical natural language processing and held in
conjunction with the annual ACL or NAACL meeting.
BioNLP shared tasks are bi-annual event held with
the BioNLP workshop started from 2009. The next
event will be held in 2015.
55
Task
Data
Released
Date
End Date
Oct. 2012
Apr. 2013
PubMed Literature
PubMed abstracts
PubMed Literature
Webpage documents
with general
information about
bacteria species
PubMed Abstracts
1. GENIA
PubMed abstracts
Dec.
2010
Apr. 2011
PubMed abstracts
PubMed abstracts
PubMed abstracts
PubMed abstracts
PubMed abstracts
PubMed abstracts
PubMed abstracts
Dec. 15
2008
Mar. 30
200956
5. Bacteria Biotopes
PubMed abstracts
PubMed abstracts
i2b2 Challenges
Informatics for Integrating Biology and the Bedside
(i2b2) is an NIH funded National Center for Biomedical
Computing (NCBC).
I2b2 center organizes data challenges to motivate the
development of scalable computational frameworks to
address the bottleneck limiting the translation of
genomic findings and hypotheses in model systems
relevant to human health.
I2b2 challenge workshops are held in conjunction with
Annual Meeting of American Medical Informatics
Association.
57
Task
Data
Release
Date
End
Date
2012
EHR
Jun. 2012
Sept.
2012
2011
Co-reference resolution
EHR
Jun. 2011
Sept.
2011
2010
Discharge
summaries
Apr. 2010
Sept.
2010
2009
Medication extraction
Narrative patient
records
Jun. 2009
Sept.
2009
2008
Discharge
summaries
Mar. 2008
Sept.
2008
2006
De-identified discharge
summaries
Discharge
summaries
Jun. 2006
Sept.
2006
58
59
Those forums cover large and diverse population and contain data
directly from patients.
Patient forum ADE reports can serve as an economical alternative to
expensive and time-consuming patient-oriented drug safety data
collection projects.
It can help to generate new clinical hypothesis, cross-validate
the adverse drug events detected from other data sources, and
Post ID
Post Content
Contain
Report
conduct comparison studies.
ADE?
9043
12200
From what you have said, it seems that Lantus [Treatment] has had some negative side ADE
ADE
source
Patient
Hearsay
34188
When taking Zocor [Treatment], I had headaches [Event] and bruising [Event].
63828
Another study of people with multiple risk factors for stroke [Event] found that Lipitor Drug
[Treatment] reduced the risk of stroke [Event] by 26% compared to those taking a Indication
placebo, the company said.
Negated
ADE
ADE
Patient
Patient
Diabetes
research
60
Test Bed
Discussion about
disease monitoring
and medical
products
Discussion
about disease
and medical
problems
Forum Name
Number
of Posts
American Diabetes
184,874
Association
Number of
Topics
Number of
Member Profiles
26,084
6,544
Diabetes Forums
568,684
45,830
12,075
Diabetes Forum
67,444
6,474
3,007
Time Span
2009.22012.11
2002.22012.11
2007.22012.11
Total Number
of Sentences
1,348,364
3,303,804
422,355
61
Solutions
Develop relation extractor for recognizing and extracting
adverse drug event relations.
Develop a text classifier to extract adverse drug event reports
based on patient experience.
62
Patient Forum Data Collection: collect patient forum data through a web crawler
Data Preprocessing: remove noisy text including URL, duplicated punctuation, etc,
separate post to individual sentences.
Medical entity extraction: identify treatments and adverse events discussed in
forum
Adverse drug event extraction: identify drug-event pairs indicating an adverse
drug event based on results of medical entity extraction
Report source classification: classify the source of reported events either from
patient experience or hearsay
63
Semantic filtering
Drug indications from FAERS
Incorporate medical domain knowledge
for differentiating drug indication from
adverse events
NegEX
Incorporate linguistic knowledge to
identify negated adverse drug events.
Semantic templates
Form filtering templates using the
knowledge from FAERS and NegEX.
65
The figure above shows the dependency tree of a sentence. In this sentence,
hypoglycemia is an adverse event and Lantus is a diabetes treatment.
Grammatical relations between words are illustrated in the figure. For instance,
cause and hypoglycemia have a relation dobj as hypoglycemia is the direct
object of cause. In this relation, cause is the governor and hypoglycemia is the
66
dependent.
To reduce the data sparsity and increase the robustness of our method,
we expand shortest dependency path by categorizing words on the
path into syntactic and semantic classes with varying degrees of
generality.
Word classes include part-of-speech (POS) tags and generalized POS tags.
POS tags are extracted with Stanford CoreNLP packages. We generalized
the POS tags with Penn Tree Bank guidelines for the POS tags. Semantic
types (Event and Treatments) are also used for the two ends of the shortest
path.
C ( xi , yi ) | xi yi |
69
70
73
f-measure
92.5%
90.8%91.6%
87.3%
91.4%90.5%90.9%
86.5%
83.5%
80.3%
Recall
83.5%
80.7%
85.4%
82.3%
79.5%
74
Recall
F-measure
100.0%
100.0%
82.0%
55.6%
62.0%
56.5%59.2%
CO
SL
American Diabetes Associa on
61.9%
64.2%60.4%62.2%
SL+SF
75.2%
68.3%
60.4%
44.8%
38.5%
78.6%
66.9%
56.6%
59.6%
62.5%
58.0%60.2%
65.5%
58.0%
41.5%
CO
SL
DiabetesForums
SL+SF
CO
SL
SL+SF
Diabetes Forum
Recall
F-measure
100.0%
83.9%84.3%84.1%
100.0%
81.2%83.1%82.1%
80.2%82.4%81.3%
69.0%
52.7%
67.9%
51.4%
76
100%
100%
21.94%
1069
39.27%
37.98%
35.97%
2972
652
19.74%
365
2
1387
Diabetes Forums
721
18.10%
107
2
421
194
Diabetes Forum
77
References
*SEM: http://ixa2.si.ehu.es/starsem/
CoNLL: http://ifarm.nl/signll/conll/
SemEval: http://alt.qcri.org/semeval2014/
CLEF eHealth: http://clefehealth2014.dcu.ie/home
BioNLP: http://2013.bionlp-st.org/
I2b2:https://www.i2b2.org/
Benton A., Ungar L., Hill S., Hennessy S., Mao J., Chung A., & Holmes J. H. (2011). Identifying potential adverse
effects using the web: A new approach to medical hypothesis generation. Journal of biomedical informatics, 44(6),
pp. 989-996.
Bian, J., Topaloglu, U., & Yu, F. (2012). Towards large-scale twitter mining for drug-related adverse events.
InProceedings of the 2012 ACM International Workshop on Smart health and wellbeing,pp. 25-32.
Bunescu R.C., Mooney R.J. (2005). A Shortest Path Dependency Kernel for Relation Extraction. In: Proceedings of the
conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 724-731.
Chee B. W., Berlin R., & Schatz B. (2011). Predicting adverse drug events from personal health messages. In:AMIA
Annual Symposium ProceedingsVol. 2011, pp. 217-226
Culotta, A., & Sorensen, J. (2004, July). Dependency tree kernels for relation extraction. InProceedings of the 42nd
Annual Meeting on Association for Computational LinguisticsAssociation for Computational Linguistics, pp. 423-429.
Leaman R., Wojtulewicz L, Sullivan R, Skariah A., Yang J, Gonzalez G. (2010) Towards Internet- Age
Pharmacovigilance: Extracting Adverse Drug Reactions from User Posts to Health-Related Social Networks, In:
Proceedings of the 2010 Workshop on Biomedical Natural Language Processing, ACL, pp.117-125.
Liu, X., & Chen, H. (2013). AZDrugMiner: an information extraction system for mining patient-reported adverse drug
events in online patient forums. InSmart Health.Springer Berlin Heidelberg, pp. 134-150.
Yang C. C., Yang H., Jiang L., & Zhang M. (2012). Social media mining for drug safety signal detection.
In:Proceedings of the 2012 international workshop on Smart health and wellbeingACM, pp. 33-40.
Zelenko D., Aone C. and Richardella A(2003): Kernel methods for relation extraction. Journal of Machine Learning
Research, 3, pp.1083-1106.
78