0% found this document useful (0 votes)
259 views78 pages

1 Text Mining Review Slides

The document provides an overview of text mining techniques, tools, ontologies, and shared tasks. It discusses common text mining techniques like text classification, sentiment analysis, topic modeling, named entity recognition, and entity relation extraction. It also outlines popular open source text mining tools and ontologies used in text mining applications.

Uploaded by

Walid_Sassi_Tun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
259 views78 pages

1 Text Mining Review Slides

The document provides an overview of text mining techniques, tools, ontologies, and shared tasks. It discusses common text mining techniques like text classification, sentiment analysis, topic modeling, named entity recognition, and entity relation extraction. It also outlines popular open source text mining tools and ontologies used in text mining applications.

Uploaded by

Walid_Sassi_Tun
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 78

TEXT MINING:

TECHNIQUES, TOOLS,
ONTOLOGIES AND SHARED TASKS
14 Spring

Introduction
Text mining, also referred to as text data mining, refers
to the process of deriving high quality information from
text.
Text mining is an interdisciplinary field that draws on
information retrieval, data mining, machine
learning, statistics and computational linguistics.
Text mining techniques have been applied in a large
number of areas, such as business intelligence,
national security, scientific discovery (especially life
science), social media monitoring and etc..
2

Introduction
In this set of slides, we are going to cover:
the most commonly used text mining
techniques
Ontologies that are often used in text mining
Open source text mining tools
Shared tasks in text mining which reflect the
hot topics in this area
A research case which applies text mining
techniques to solve a healthcare related
problem with social media data.
3

TEXT MINING
Text Classification
TECHNIQUES
Sentiment Analysis
Topic Modeling
Named Entity Recognition
Entity Relation Extraction
4

Text Classification
Text Classification or text categorization is a problem
in library science, information science, and computer
science. Text classification is the task of choosing
correct class label for a given input.
Some examples of text classification tasks are
Deciding whether an email is a spam or not (spam
detection) .
Deciding whether the topic of a news article is from a fixed
list of topic areas such as sports, technology, and
politics (document classification).
Deciding whether a given occurrence of the word bank is
used to refer to a river bank, a financial institution, the act of
tilting to the side, or the act of depositing something in a
5
financial institution (word sense disambiguation).

Text Classification

Text classification is a supervised machine learning task as it


is built based on training corpora containing the correct label for
each input. The framework for classification is shown in figure
below.

(a) During training, a feature extractor is used to convert each input value to a feature set. These feature sets,
which capture the basic information about each input that should be used to classify it, are discussed in the
next section. Pairs of feature sets and labels are fed into the machine learning algorithm to generate a model.
(b) During prediction, the same feature extractor is used to convert unseen inputs to feature sets. These
6
feature sets are then fed into the model, which generates predicted labels.

Text Classification
Common features for text classification include:
bag-of words (BOW), bigrams, tri-grams and part-ofspeech(POS) tags for each word in the document.
The most commonly adopted machine learning
algorithms for text classifications are nave Bayes,
support vector machines, and maximum
entropy
classifications.
Algorithm
Language
Tools
Java
Python
C++
Support Vector
MatLab
Machines
Java
Java
Maximum entropy
Python
Nave Bayes

Weka, Mahout, Mallet


NLTK
SVM-light, mySVM, LibSVM
SVM Toolbox
Weka
Mallet
NLTK
7

Sentiment Analysis
Sentiment analysis (also known as opinion mining) refers
to the use of natural language processing, text analysis
and computational linguistics to identify and extract
subjective information in source material.
The rise of social media such as forums, micro blogging
and blogs has fueled interest in sentiment analysis.
Online reviews, ratings and recommendations in social media sites
have turned into a kind of virtual currency for businesses looking to
market their products, identifying new opportunities and manage
their reputations.
As businesses look to automate the process of filtering out the
noise, identifying relevant content and understanding reviewers
opinions, sentiment analysis is the right technique.
8

Sentiment Analysis

The main tasks, their descriptions and approaches are


summarized in the table below:
Task

description

Polarity
classifying a given text at the document,
Classificatio sentence, or feature/aspect level into
n
positive, negative or neutral

Affect
Analysis

Classifying a given text into affect states


such as "angry", "sad", and "happy"

Subjectivity Classifying a given text into two classes:


Analysis
objective and subjective
Determining the opinions or sentiment
Feature/Asp
expressed on different features or aspects
ect Based
of entities (e.g., the screen[feature] of a
Analysis
cell phone [entity])
Opinion
Detecting the holder of a sentiment (i.e.

Approaches
lexicon based
scoring
machine
learning
classification
lexicon based
scoring
machine
learning
classification
lexicon based
scoring
machine
learning
classification
Named entity
recognition +
entity relation
detection
Named entity

lexicons/
algorithms
SentiWordNet,
LIWC
SVM
WordNet-Affect
SVM
SentiWordNet,
LIWC
SVM
SentiWordNet,
LIWC, WordNet
SVM
9
SentiWordNet,

Topic Modeling

Topic models are a suite of algorithms for discovering the main


themes that pervade a large and otherwise unstructured
collection of documents.

Topic Modeling algorithms include Latent Semantic


Analysis(LSA), Probability Latent Semantic Indexing (PLSI), and
Latent Dirichlet Allocation (LDA).
Among them, Latent Dirichlet Allocation (LDA) is the most
commonly used nowadays.

Topic modeling algorithms can be applied to massive collections


of documents.
Recent advances in this field allow us to analyze streaming
collections, like you might find from a Web API.

Topic modeling algorithms can be adapted to many kinds of

10

Topic Modeling - LDA


The figure below shows the intuitions behind latent Dirichlet allocation. We assume that
some number of topics, which are distributions over words, exist for the whole collection (far
left). Each document is assumed to be generated as follows. First choose a distribution over the
topics (the histogram at right); then, for each word, choose a topic assignment (the colored
coins) and choose the word from the corresponding topic .

11

Topic Modeling - LDA


The figure below show real inference with LDA. 100-topic LDA model is fitted to
17,000 articles from journal Science. At left are the inferred topic proportions for the
example article in previous figure. At right are the top 15 most frequent words from
the most frequent topics found in this article.

12

Topic Modeling - Tools


Name
lda-c
class-slda

Model/Algorith Languag
m
e
Author
Latent Dirichlet
C
D. Blei
allocation
Supervised topic
models for
C++
C. Wang
classification

Notes
This implements variational inference
for LDA.
Implements supervised topic models
with a categorical response.

lda

R package for
Gibbs sampling
in many models

J. Chang

Implementsmanymodels and isfast.


Supports LDA, RTMs (for networked
documents), MMSB (for network data),
and sLDA (with a continuous
response).

tmve

Topic Model
Visualization
Engine

Python

A. Chaney

A package for creating corpus


browsers.

dtm

Dynamic topic
models and the
influence model

C++

Correlated topic
C
models
LDA,
Mallet
Java
Hierarchical LDA
LDA, Labeled
Stanford topi LDA, Partially
Java
c modeling to Labeled LDA
ctm-c

S. Gerrish
D. Blei
A.
McCallum

This implements topics that change


over time and a model of how
individual documents predict that
change.
This implements variational inference
for the CTM.
Implements LDA and Hierarchical LDA

Stanford
Implements LDA, Labeled LDA, and
13
NLP Group PLDA

Named Entity Recognition

Named entity refers to anything that can be referred to with


a proper name.
Named entity recognition aims to
Find spans of text that constitute proper names
Classify the entities being referred to according to their type

Type

Sample
Categories

Example

People

Individuals, fictional
Characters

Turing is often considered to be the father of modern


computer science.

Organization

Companies, parties

Amazon plans to use drone copters for deliveries.

Location

Mountains, lakes, seas

Geo-Political

Countries, states,
provinces

TheCatalinas, are located north, and northeast


ofTucson,Arizona,United States.

Facility

Bridges, airports

In the late 1940s, Chicago Midway was the busiest


airport in the United States by total aircraft operations.

The highest point in the Catalinas isMount Lemmonat


an elevation of 9,157 feet above sea level.

Planes, trains, cars


The updated Mini Cooper retains its charm and agility.
Vehicles
In practice, named entity recognition can be extended to types that are not in the
table above, such as temporal expressions (time and dates), genes, proteins,
14
medical related concepts (disease, treatment and medical events) and etc..

Named Entity Recognition


Named entity recognition techniques can be
categorized into knowledge-based approaches
and machine learning based approaches.
Category

Advantage

Knowledgebased
approach

Require little
training data

Machine
learning
approach
- Conditional
Random Field
(CRF)
- Hidden
Markov Model
(HMM)

Reduced
human effort
in
maintaining
rules and
dictionaries

Disadvantage
Creating lexicon
manually is timeconsuming and
expensive;
encoded
knowledge might
be importable
across domains.

Tools /Ontology
General Entity Types
WordNet
Lexicons created by experts
Medical domain:
GATE (University of Sherfield)
UMLS (National library of Medicine)

MedLEE (Originally from Columbia


University, commericalized now)
Conditional Random Field tools
Stanford NER
CRF++
Prepared a set of Mallet
annotated training Hidden Markov Model tools
data
Mallet
Natural Language Toolkit(NLTK)

15

Entity Relation Extraction


Entity relation extraction discerns the
relationships that exist among the entities
detected in a text. Entity relation extraction
techniques are applied in a variety of areas.
Question Answering
Extracting entities and relational patterns for answering
factoid question

Feature/Aspect based Sentiment Analysis


Extract relational patterns among entity, features and
sentiments in text R(entity, feature, sentiment).

Mining bio-medical texts


Protein binding relations useful for drug discovery
Detection of gene-disease relations from biomedical literature
Finding drug-side effect relations in health social media
16

Entity Relation Extraction


Entity relation extraction approaches can be
categorized into three types
Category

Method

Advantage

Disadvantag
e

Tools

Cooccurrence
Analysis

If two entities cooccur within certain


distance, they are
considered to have a
relation

Simplicity and
flexibility; high
recall

Low precision;
cant decide
relation types

Rule-based
approaches

Create rules for


relation extraction
based on syntactic
and semantic
information in the
sentences

General, flexible;

Lower
portability across
different
domains
Manual
encoding of
syntactic and
semantic rules

Syntactic
information:
Stanford Parser;
OpenNLP;
Semantic
information:
Domain
Knowledge bases

Supervised
Learning

Feature-based
methods: feature
representation
Kernel-based
methods:

Little or no
manual
development of
rules and
templates

Annotated
corpora is
required.

Dan Bikels
parser;
MST parser;
Stanford parser;
17
SVM classifier:

Supervised Learning Approaches for


Entity Relation Extraction
Supervised learning approach breaks relation extraction
into two subtasks (relation detection and relation
classification). Each task is a text classification
problem.
Classifier 1:
Detect when a
relation is
present
between two
entities

Classifier 2:
Classify the
relation types

Supervised learning approach can be categorized by


Feature based
feature based methods and
kernel
based methods.
methods
Feature
Extraction

Sentences

Text Analysis
(POS, Parse Trees)

Classifier
Kernel
Function
Kernel based methods

18

Supervised Learning Approach to


Entity Relation Extraction
Feature based methods rely on features to
represent instances for classification. The
features for relation extraction can be categorized
Entity-based
into: features
Word-based features
Syntactic features
Entity types of the two
candidate arguments

Bag-of-words and bag-ofbigrams between entities

Presence of particular
constructions in a constituent
structure

Concatenation of the two


entity types

Stemmed version of Bag-ofwords and bag-of-bigrams


between entities

Chunk based-phrase paths

Headwords

Words and stems immediately


preceding and following the
entities

Bags of chunk heads

Bag-of-words from the


arguments

Distance in words between


the arguments

Dependency-tree paths

Number of entities between


the arguments

Constituent-tree paths
Tree distance between the
19
arguments

Supervised Learning Approach to


Entity Relation Extraction
Kernel-based methods are an effective alternative to
explicit feature extraction.
They retain the original representation of objects and use
the object only via computing a kernel function between a
pair of objects.

Kernel K(x,y) defines similarity between objects x


and y implicitly in a higher dimensional space.
Commonly
used kernel functions
Author
Kernels
Description for relation
Node attributes
Zelenko et Shalloware:
Parse Tree
entity type,word,
extractions
al. 2003
Kernel
Use shallow parse trees
POS tag
Culotta et
al. 2004

Dependency tree
kernel

Bunescu et Shortest dependency


al. 2005
path kernel

Use dependency parse


trees
shortest path between
entities in a dependency
tree

Word, POS,
Generalized POS,
Chunk tag, Entity
Type, Entity level
Word, POS,
Generalized POS,
Entity Type

20

Ontology

Ontology represents knowledge as a set of concepts with a


domain, using a shared vocabulary to denote types, properties,
and interrelationships of those concepts.
In text mining, ontology is often used to extract named entities,
detect entity relations and conduct sentiment analysis.
Commonly used ontologies are listed in the table below:
Name
WordNet

Creator
Princeton University

Andrea Esuli, Fabrizio


Sebastian
James W. Pennebaker,
Linguistic Inquiry and W Roger J. Booth, Martha
ord Count(LIWC)
E. Francis
SentiWordNet

Description

Application

A large lexical database of English.

Word sense
disambiguation
Text summarization
Text similarity analysis

SentiWordNet a lexical resource for opinion


mining.

Sentiment analysis

LIWC is a lexical resource for sentiment analysis.

Sentiment analysis
Affect analysis
Deception detection

TheUnified Medical Language System(UMLS) is


Medical entity
acompendiumof manycontrolled
recognition
vocabulariesin thebiomedicalsciences.
Medical entity
Canadian Adverse Drug
A knowledge base about drug and side effect in recognition
MedEffect
Reaction Monitoring
Canada
Drug safety
Program(CADRMP)
surveillance
Medical entity
Mapping consumer health vocabulary to
Consumer Health Vocabu University of Utah
recognition, Health
standard medical terms in UMLS.
lary (CHV)
social media analytics
21
Documenting adverse drug event reports and
Medical entity
FDAs Adverse Event Rep United States Food and
drug indications of all the medical products in
US National Library of
Unified Medical Languag
Medicine
e System (UMLS)

WordNet
WordNet is an online lexical database in
which English nouns, verbs, adjectives and
adverbs are organized into sets of
synonyms.
Each word represents a lexicalized concept.
Semantic relations link the synonym sets
(synsets).

WordNet contains more than 118,000


different word forms and more than 90,000
senses.
Approximately 17% of the words in WordNet are
22
polysemous (have more than on sense); 40%

WordNet
Six semantic relations are presented in WordNet because they apply broadly
throughout English and because a user need not have advanced training in
linguistics to understand them. The table below shows the included
semantic relations.
Semantic Relation

Syntactic Category

Examples

Synonymy
(similar)

Noun, Verb, Adjective,


Adverb

Pipe, tube
Rise, ascent
Sad, happy
Rapidly, speedily

Antonymy
(opposite)

Adjective, Adverb

Wet, dry
Powerful, powerless
Rapidly, slowly

Hyponymy
(subordinate)

Noun

Maple, tree
Tree, plant

Meronymy
(part)

Noun

Brim, hat
Ship, fleet

Troponomy
(manner)

Verb

March, walk
Whisper, speak

Entailment

Verb

Drive, ride
Divorce, marry

WordNet has been used for a number of different purposes in information


systems, includingword sense disambiguation,information retrieval,text
classification,text summarization,machine translationand semantic textual
23
similarity analysis .

SentiWordNet

SentiWordNet is a lexical resource explicitly devised for


supporting sentiment analysis and opinion mining applications.
SentiWordNet is the result of the automatic annotation of all
the synsets of WordNet according to the notions of positivity,
negativity and objectivity.
Each of the positivity, negativity and objectivity scores
ranges in the interval [0.0,1.0], and their sum is 1.0 for each
synset.

The figure above shows the graphical representation adopted by


SentiWordNet for representing the opinion-related properties of a term sense.24

SentiWordNet
In SentiWordNet, different senses of the same
term may have different opinion-related
Search
properties.
term

Sense 1
Positivity,
objectivity and
negativity score

Sense 3

Sense 2

Synonym of
estimable in
this sense

The figure above shows the visualization of opinion related properties of the term estimable
in SentiWordNet (http://sentiwordnet.isti.cnr.it/search.php?q=estimable).

25

Linguistic Inquiry and Word Count


(LIWC)
Linguistic Inquiry and Word Count (LIWC) is a text
analysis program that looks for and counts word in
psychology-relevant categories across text files.
Empirical results using LIWC demonstrate its ability to
detect meaning in a wide variety of experimental
settings, including to show attentional focus,
emotionality, social relationships, thinking
styles, and individual differences.
LIWC is often adopted in NLP applications for sentiment
analysis, affect analysis, deception detection and etc..
26

Linguistic Inquiry and Word Count


(LIWC)
The LIWC program has two major components: the
processing component and the dictionaries.
Processing
Opens a series of text files (posts, blogs, essays, novels, and so
on)
Each word in a given text is compared with the dictionary file.

Dictionaries: the collection of words that define a


particular category
English dictionary: over 100,000 words across over 80
categories examined by human experts.
Major categories: functional words, social processes,
affective processes, positive emotion, negative emotion,
cognitive processes, biological processes, relativity and
etc..
Multilingual: Arabic, Chinese, Dutch, French, German, Italian,
Portuguese, Russian, Serbian, Spanish and Turkish.
27

Linguistic Inquiry and Word Count


(LIWC)
LIWC
results
from input
text
LIWC
categori
es

LIWC results
from personal
text and formal
writing for
comparison

Input text: A post


from a 40 year old
female member in
American Diabetes
Association online
community

LIWC online demo:

28

Unified Medical Language System


(UMLS)
The Unified Medical Language System (UMLS) is a
repository of biomedical vocabularies developed
by the US National Library of Medicine.
UMLS integrates over 2.5 million names for 900,551
concepts from more than 60 families of biomedical
vocabularies, as well as 12 million relations among these
concepts.
Ontologies integrated in the UMLS Metathesaurus include
the NCBI taxonomy, Gene Ontology (GO), the Medical
Subject Headings (MeSH), Online Mendelian Inheritance in
Man(OMIM), University of Washington Digital Anatomist
symbolic knowledge base (UWDA) and Systematized
Nomenclature of MedicineClinical Terms(SNOMED CT). 29

Unified Medical Language System


(UMLS)
Major Ontologies integrated in
UMLS

Name

Creator

National Center for Biotec


hnology Information (NCBI)
Taxonomy
University of Washington D
igital Anatomist Source In
formation (UWDA)
Gene Ontology(GO)

National Library of Medicine

Gene Ontology Consortium

Gene product characteristics and gene


product annotation data

Medical Subject Headings


(MeSH)

National Library of Medicine

Vocabulary thesaurus used for indexing


articles for PubMed

McKusick-Nathans Institute
of Genetic Medicine
Johns Hopkins University

human genes and genetic phenotypes

Annotate human
genes

Comprehensive, multilingual clinical


healthcare terminology in the world

Identify clinical
30
terms

Online Mendelian Inheritan


ce in Man(OMIM)

Description
All of the organisms in public sequence
database

Symbolic models of the structures and


University of Washington
relationships that constitute the human
Structural Informatics Group
body.

College of American
Systematized Nomenclature
Pathologists
of Medicine--Clinical Te

Application
Identify
organisms
Identify terms in
anatomy
Gene product
annotation
Cover terms in
biomedical
literature

Unified Medical Language System


(UMLS)
Accessing UMLS data
No fee associated, license agreement required
Available for research purposes, restrictions apply for
other kinds of applications

UMLS related tools


MetamorphoSys (command line program)
UMLS installation wizard and customization tool
Selecting concepts from a given sub-domain
Selecting the preferred name of concepts

MetaMap (Java)
Extracts UMLS concepts from text
Variable length of input text
Outputs a ranked listed of UMLS concepts associated with
input text

31

MedEffect
MedEffect is the Canada Vigilance Adverse Reaction
Online Database, which contains information about
suspected adverse reactions to health products.
Report submitted by consumers and health professionals
Containing a complete list of medications, adverse
reactions and drug indications (medical conditions for legit
use of medication)

MedEffect is often used in healthcare research for


annotating medications and adverse reactions from
text (Leaman et al. 2010; Chee et al. 2011).
32

Consumer Health
Vocabulary (CHV)
Consumer Health Vocabulary (CHV) is a lexicon linking
UMLS standard medical terms to health consumer
vocabulary.
Laypeople have different vocabulary from healthcare professionals
to describe medical problems.
CHV helps to bridge the communication gap between consumers
and healthcare professionals by mapping the UMLS standard
medical terms to consumer health language.

It has been applied in prior studies to better understand


and match user expressions for medical entity extraction
in social media (Yang et al. 2012; Benton et al. 2011).
33

FDAs Adverse Event Reporting


System (FAERS)
FDAs Adverse Event Reporting System(FAERS)
documents adverse drug event reports and drug
indications of all the medical products in US market.
Reports submitted by consumers, health professionals,
pharmaceutical companies and researchers.
Containing complete list of medical products in United
States and their suspected adverse reactions

FAERS has been applied in healthcare research for


medical named entity recognitions and adverse
drug event extractions (Bian et al. 2012, Liu et al.
2013).
34

A-Z LIST OF OPEN SOURCE


NLP TOOLKITS
35

Name
Antelope
framework
Apertium

ClearTK

cTakes

DELPH-IN

Factorie
FreeLing
General
Architecture
for Text
Engineering
(GATE)
Graph
Expression

Main Features
Part-of-speech tagging, dependency parsing, WordNet
lexicon
Machine translation for language pairs from Spanish,
English, French, Portuguese, Catalan and Occitan
Wrappers for machine learning libraries(SVMlight,
LibSVM, OpenNLP MaxEnt) and NLP tools (Snowball
Stemmer, OpenNLP, Stanford CoreNLP)

Languag
e

C#,VB.n
Proxem
et
C+
(various)
+,Java
The Center for
Computational
Language and
Java
Education
Research at
theUniversity of
Colorado Boulder

Sentence boundary detection, tokenization,


normalization, POS tagging, chunking, context(family
history, symptoms, disease, disorders, procedures)
Java
annotator, negation detection, dependency parsing, drug
mention annotator
Deep linguistic analysis: head-driven phrase structure
grammar (HPSG) and minimal recursion semantic
parsing
scalable NLP toolkit for named entity recognition,
relation extraction, parsing, pattern matching, and topic
modeling(LDA)
Tokenization, sentence splitting, contradiction splitting,
morphological analysis, named entity recognition, POS
tagging, dependency parsing, co -reference resolution

Creators

LISP,C+
+
Java
C++

Children's
Hospital Boston,
Mayo Clinic

Websit
e
[1]
[2]

[3]

[4]

Deep Linguistic
Processing
[5]
withHPSGInitiati
ve
University of
Massachusetts
[6]
Amherst
Universitat
Politcnica de
[7]
Catalunya

Information extraction(tokenization, sentence splitter,


POS tagger, named entity recognition, coreference
resolution), machine learning library wrapper(Weka,
MaxEnt, SVMLight, RASP, LibSVM), Ontology (WordNet)

Java

GATE open
source
community

[8]

Information extraction (named entity recognition,


relation and fact extraction, parsing and search problem
solving)

Java

Startup huti.ru

[9]

36

Name

Main Features

Languag
e

Creators

Website

Java

Cognitive
Computation
Group at UIUC

[10]

LingPipe

Topic classification, named entity recognition, clustering,


POS tagging, spelling correction, sentiment analysis,
logistic regression, word sense disambiguation

Java

Alias-i

[11]

Mahout

Scalable machine learning libraries (logistic regression,


Nave Bayes, Random Forest, HMM, SVM, Neural Network,
Boosting, K-means, Fuzzy K-means, LDA, Expectation
Maximization, PCA )

Java

Online
community

[12]

Mallet

Document classification(Nave Bayes, Maximum Entropy,


decision trees), sequence tagging (HMM, MEMM, CRF),
topic modeling (LDA, Hierarchical LDA)

Java

University of
Massachusetts
Amherst

[13]

Map biomedical text to the UMLS Metathesaurus and


discover Metathesaurus concepts referred to in text.

Java

MII nlp toolkit

de-identification tools for free-text medical reports

Java

MontyLingua

Tokenization, POS tagging, chunking, extractors for


phrases and subject/verb/object tuples from sentences,
morphological analysis, text summarization

Python,
Java

MIT

[16]

Natural
Language
Toolkit(NLTK)

Interface to over 50 open access corpora, lexicon


resource such as WordNet, text processing libraries for
classification, tokenization, stemming, POS tagging,
parsing and semantic reasoning.

Python

Online
community

[17]

NooJ(based
onINTEX)

Morphological analysis, syntactic parsing, named entity


recogntion

.NET
Framewo
rk-based

University of
FrancheComt,France

[18]

Learning Based POS tagger, Chunking, coreference resolution, named


Java
entity recognition

MetaMap

National Library
of Medicine
UCLA Medical
Imaging
Informatics (MII)
Group

[14]

[15]

37

Name

Main Features

OpenNLP

Tokenization, sentence segmentation, POS tagging, named


entity extraction, chunking, parsing, coreference resolution

Pattern

PSI-Toolkit

Online community [19]

Adam Mickiewicz
University in
Pozna

[21]

Scala

David Hall and


Daniel Ramage

[22]

Java

The Stanford
Natural Language
Processing Group

[23]

C++

University of
Cambridge,
University of
Sussex

[24]

Tokenization, stemming, classification (Nave Bayes, logistic JavaScript


Chris Umbel
regression),morphological analysis, WordNet
,NodeJs

[25]

[26]

Stanford NLP

Treex

Website

[20]

Tokenization, POS tagging, named entity recognition,


parsing, coreference, topic modeling, classification (Nave
Bayes, logistic regression, maximum entropy), sequence
tagging(CRF)

Text
Engineering
Software
Laboratory
(Tesla)

Creators

Tom De Smedt,
CLiPS,University
of Antwerp

ScalaNLP

Natural

Java

Wrapper for Google, Twitter and Wikipedia API, web crawler,


HTML DOM parsing, POS tagging, n-gram search, sentiment
analysis, WordNet, machine learning algorithms for
Python
clustering and classification, network analysis and
visualization
Text preprocessing, sentence splitting, tokenization, lexical
and morphological analysis, syntactic/ semantic parsing,
C++
machine translation
Tokenization, POS tagging, sentence segmentation,
sequence tagging (CRF, HMM), machine learning
algorithms(linear regression, Nave Bayes, SVM, K-Means,
LDA, Neural Network )

Rasp

Language

Tokenization, POS tagging, lemmatization, parsing

Tokenization, POS tagging, sequence alignment

Java

University of
Cologne

Machine translation

Perl

Charles University
[27]
in Prague

38

Languag
e

Name

Main Features

UIMA

Industry standard for content analytics, contains a set of


rule based and machine learning annotators and tools

Java/C+
Apache
+

[28]

Tokenization, POS tagging, named entity recognition,


classification, text summarization

NLP++ /
compiles
to C++

[29]

VisualText

Language identification, named entity recognition,


WebLab-project semantic analysis, relation extraction, text classification
and clustering, text summarization

Creators

Text Analysis
International, Inc

Java/C+
OW2
+

Website

[30]

UniteX

Tokenization, sentence boundary detection, parsing,


morphological analysis, rule-based named entity
recognition, text alignment, word sense disambiguation

Java&C
++

Laboratoire
d'Automatique
Documentaire et
Linguistique

[31]

The Dragon
Toolkit

tools for accessing PubMed, TREC collection, NewsGroup


articles, Reuters Articles, and Google Search Engine,
ontologies(UMLS, WordNet, MeSH), tokenization,
stemming, POS tagging, named entity recognition,
classification(Nave Bayes, SVM-light, LibSVM, logistic
regression), clustering(K-Means, hierarchical clustering),
topic modeling(LDA), text summarization,

Java

Drexel University

[32]

Text Extraction,
Annotation and
Retrieval
Toolkit

Tokenization, chunking, sentence segmenting, parsing,


ontology(WordNet), topic modeling(LDA), named entity
recognition, stemming, machine learning
algorithms(decision tree, SVM, neural network)

Ruby

Louis Mullie

[33]

Zhihuita NLP
API

Chinese text segmentation, spelling checking, pattern


matching,

Zhihuita.org

[34]

39

SHARED TASKS (COMPETITIONS) IN


HEALTHCARE AND NATURE LANGUAGE
PROCESSING DOMAINS

40

Introduction
Shared task series in Nature Language Processing often represent a
community-wide trend and hot topics which are not fully explored in the
past.
To keep up with the state-of-the-art techniques and new research topics in
NLP community, we explore major conferences, workshops, special
interest groups belonging to Association for Computational Linguistics
(ACL).
We organize our findings into two categories: ongoing shared tasks and
watch list.
Ongoing list contains competitions that have already made task descriptions, data and
schedules for 2014 publicly available.
International Workshop on Semantic Evaluation (SemEval)
CLEF eHealth Evaluation Lab

Watch list contains competitions that havent made content available but are relevant
to our interests.

Conference on Nature Language Learning (CoNLL) Shared Tasks


Joint Conference on Lexical and Computational Semantics (*SEM) Shared Tasks
BioNLP
i2b2 Challenge

41

SemEval
Overview
SemEval, International Workshop on Semantic Evaluation,
is an ongoing series with evaluation of computational
semantic analysis systems. It evolved from the SensEval
(word sense evaluation) series.
SIGLEX, a Special Interest Group on Lexicon of the
Association for Computational Linguistics, is the umbrella
organization for the SemEval.
SemEval- 2014 will be the 8th workshop on semantic
evaluation. The workshop will be co-located with the 25th
International Conference on Computational Linguistics
(COLING) in Dublin, Ireland.
42

SemEval

Past workshops
Workshop

No. of
Tasks

Areas of study

Languages of Data Evaluated

Senseval1(1998)

Word Sense Disambiguation (WSD) - Lexical Sample WSD tasks

English, French, Italian

Senseval2(2001)

12

Czech, Dutch, English, Estonian,


Word Sense Disambiguation (WSD) - Lexical Sample, All Words, Basque, Chinese, Danish, English,
Translation WSD tasks
Italian, Japanese, Korean, Spanish,
Swedish

Senseval3(2004)

16

Logic Form Transformation, Machine Translation (MT) Evaluation, Basque, Catalan, Chinese, English,
Semantic Role Labeling, WSD
Italian, Romanian, Spanish

19

Cross-lingual, Frame Extraction, Information Extraction, Lexical


Substitution, Lexical Sample, Metonymy, Semantic Annotation,
Arabic, Catalan, Chinese, English,
Semantic Relations, Semantic Role Labeling, Sentiment Analysis,
Spanish, Turkish
Time Expression, WSD

SemEval-2010

18

Co-reference, Cross-lingual, Ellipsis, Information Extraction,


Catalan, Chinese, Dutch, English,
Lexical Substitution, Metonymy, Noun Compounds, Parsing,
French, German, Italian, Japanese,
Semantic Relations, Semantic Role Labeling, Sentiment Analysis,
Spanish
Textual Entailment, Time Expressions, WSD

SemEval-2012

Common Sense Reasoning, Lexical Simplification, Relational


Similarity, Spatial Role Labeling, Semantic Dependency Parsing,
Semantic and Textual Similarity

Chinese, English

14

Temporal Annotation, Sentiment Analysis, Spatial Role Labeling,


Noun Compounds, Phrasal Semantics, Textual Similarity,
Response Analysis, Cross-lingual Textual Entailment, BioMedical
Texts, Cross and Multi-lingual WSD, Word Sense Induction, and
Lexical Sample

Catalan, French, German, English,


Italian, Spanish

SemEval-2007

SemEval-2013

43

Tas
k ID

SemEval-2014
Task Name
Evaluation of
compositional
distributional
semantic models
(CDSMs) on full
sentences

Description
Subtask A: predicting the degree of
relatedness between two sentences
Subtask B: detecting the entailment
relation holding between them

Creating clusters consisting of


semantically similar fragments.
Grammar
For example, the following two
Induction for
fragments: depart from <City> and
Spoken Dialogue
fly out of <City> are in the same
Systems
cluster as they refer to the concept of
departure city.
Evaluating similarity across different
Cross-level
sizes of text: paragraph to sentence,
semantic similarity sentence to phrase, phrase to word and
word to sense.
Subtask
Aspect Based
Subtask
Sentiment Analysis Subtask
Subtask

L2 writing
assistant

1:
2:
3:
4:

Aspect
Aspect
Aspect
Aspect

term extraction
term polarity
category detection
category polarity

Build a translation assistance system


that concerns the translation of
fragments of one language (L1), i.e.
words or phrases in a second language
(L2) context.
For example, input
(L1=French,L2=English): I rentre la

Data
10,000 English sentence pairs, each
annotated for relatedness score in meaning
and the entailment relation (entail,
contradiction, and neutral) between the two
sentences.
Training data will cover two domains: air
travel and tourism.
The data will be available in two languages:
Greek and English.

Information about data hasn't been


released yet.
Two domain-specific datasets (restaurant
reviews and laptop reviews), consisting of
over 6,500 sentences with fine-grained
aspect-level human-authored annotations
will be provided.
The data set covers the following L1 and L2
pairs : English-German, English-Spanish,
French-English and Dutch-English.
The trial data contains 500 sentences44
for
each language pair. Information about

Task
ID

SemEval-2014
Task Name

Description

Data
In trial data, each natural language
command is annotated into robot
command.
"Move the blue block on top of the grey
block." is labeled as
(event: (action: move) (entity: (color: blue)
(type: cube)) (destination: (spatial-relation:
(relation: above) (entity: (color: gray) (type:
cube)))))

Spatial Robot
Commands

Parse spatial robot commands using data


from an annotated corpus, collected from a
simplified blocks world game
(http://www.trainrobot.com)

Analysis of
Clinical Text

Combine supervised methods for


entity/acronym/abbreviation recognition and
mapping to UMLS CUIs (Concept Unique
Identifiers) with unsupervised discovery and
sense induction of the
entities/acronyms/abbreviations.

Information about data hasn't been


released yet.

Broad-Coverage
and CrossFramework
Semantic
Dependency
Parsing

This task seeks to stimulate more


generalized semantic dependency parsing
and give a more direct analysis of who did
what to whom from sentences.

In trial data, 198 sentences from WSJ are


annotated with the desired semantic
representation.

Sentiment
Analysis for
Twitter

Subtask A - Contextual Polarity


Disambiguation: Given a message containing
a marked instance of a word or a phrase,
determine whether that instance is positive,
negative or neutral in that context.
Subtask B - Message Polarity Classification:
Given a message, decide whether the
message is of positive, negative, or neutral
sentiment.

training:9,728 Twitter messages


development:1,654 Twitter messages (can
be used for training as well)
development-test
A:3,814Twittermessages (CANNOT be
used for training)
development-test B:2,094SMSmessages
(CANNOT be used for training)
45
The annotations and systems will use a

SemEval-2014
Important Dates

Trial data ready Oct. 30, 2013


Training data ready Dec. 15, 2013
Test data ready Mar. 10, 2014
Evaluation end Mar. 30, 2014
Paper submission due Apr. 30, 2014
Paper reviews due May. 30, 2014
Camera ready due Jun. 30, 2014
Workshop Aug. 23-30, 2014, Dublin, Ireland

46

CLEF eHealth Evaluation


Lab

Overview

The CLEF Initiative (Conference and Labs of the Evaluation


Forum,) is a self-organized body whose main mission is to
promote research, innovation, and development of
information access systems with an emphasis on
multilingual and multimodal information with various levels
of structure.
Started from 2000, the CLEF aims to stimulate
investigation and research in a wide range of key areas in
the information retrieval domain, becoming well-known in
the international IR community. The results were
traditionally presented and discussed at annual workshops
in conjunction with the European Conference for Digital
Libraries (ECDL), now called Theory and Practice on Digital
Libraries (TPDL).
47

CLEF eHealth Evaluation


Lab
Overview
In Year 2013, CLEF started eHealth Evaluation Lab, a
shared task focused on natural language
processing(NLP) and information retrieval (IR) for
clinical care.
The CLEF eHealth Evaluation Lab 2013 has three
tasks:
Annotation of disorder mentions spans from clinical reports
Annotation of acronym/abbreviation mention spans from
clinical reports
Information retrieval on medical related web documents

48

CLEF eHealth 2014

Tas
k
ID

Task

VisualInteractive
Search and
Exploration
of eHealth
Data

A set of de-identified clinical reports are


Develop annotated data, resources, methods that provided by the MIMIC II database.
make clinical documents easier to understand
A training set of 300 reports and their
from nurses and patients perspective.
disease/disorder mention templates with
Information
10 different attributes: Negation Indicator, Subject filled attribute: value slots will be provided.
extraction
Class, Uncertainty Indicator, Course Class,
A test set of 200 reports and their
from
Severity Class, Conditional Class, Generic Class,
disease/disorder mention templates with
clinical text
Body Location, DocTime Class, and Temporal
default-filled attribute: value slots will be
Expression, should be captured from clinical text
provided will be provided for the Task 2
and classified into certain value slot.
challenge one week before the run
submission deadline.

Usercentered
health
information
retrieval

Description

Data

Subtask A: visualize discharge summary together


with the disorder standardization and shorthand
expansion data in an effective and
6 de-identified discharge summaries and
understandable way for laypeople
50 real patient search queries genereated
Subtask B:design a visual exploration approach
from the discharge summary
that will provide an effective overview over a
larger set of possibly relevant documents to meet
the patients information need.

Subtask A: monolingual information retrieval taskretrieve the relevant medical documents for the
user queries
Subtask B: multilingual information retrieval task German, French and Czech.

Aset of medical-related documents in four


languages (English, German, French and
Czech) are provided by the Khresmoi
project (approximately 1 million medical
documents for each language). 5 training
49
queries and 50 test queries are provided.

CLEF eHealth 2014


Important Dates
CLEF2014 Lab registration opens Nov 2013
Task data release begins Nov. 15 2013
Participant submission deadline:final
submission to be evaluated May 01 2014
Results released Jun. 01 2014
Participant working notes (i.e., extended
abstracts and reports) submission deadline Jun.
15 2014
CLEF eHealth lab session at CLEF 2014 in
Sheffield, UK Sept. 15 - 18 2014
50

CoNLL
Overview
CoNLL, the Conference on Natural Language Learning is a yearly
meeting of Special Interest Group on Nature Language Learning
(SIGNLL) of the Association for Computational Linguistics (started
from 1997).
Since 1999, CoNLL has included a shared task in which training and
test data is provided by the organizers which allows participating
systems to be evaluated and compared in a systematic way.
Description of the systems and evaluation of their performances
are presented both at the conference and in the proceedings.
The last CoNLL was held in August 2013, in Sofia, Bulgaria, Europe.
Information about CoNLL 2014 and its shared task will be released
in next month.
51

CoNLL
Recent shared tasks from CoNLL
Year

Task

2013 Grammatical Error Correction

Data
National University of Singapore
Corpus of Learner English
(NUSCLE)

Modeling Multilingual Unrestricted OntoNotes dataset from


2012
Coreference in OntoNotes
Linguistic Data Consortium
Modeling Unrestricted Coreference OntoNotes dataset from
in OntoNotes
Linguistic Data Consortium
A: biological abstracts and full
articles from
Subtask A: Learning to detect
theBioScope(biomedical
sentences containing uncertainty
2010
domain) corpus
Subtask B: Learning to resolve the
B: paragraphs from Wikipedia
in-sentence scope of hedge cues
possibly
containingweaselinformation
2011

Syntactic and Semantic


2009 Dependencies in Multiple
Languages

Data with gold standard


annotation of syntactic
dependency, type of
dependency, frame, role set and
sense in multiple languages

Language
English
Arabic,
Chinese,
English
English

English

English,
Catalan,
Chinese,
Czech,
German,
Japanese and
52
Spanish

*SEM
Overview
Joint Conference on Lexical and Computational Semantics
(*SEM), started from 2012, is organized by Association for
Computational Linguistics (ACL) Special Interest Group on
Lexicon (SIGLEX) and Special Interest Group on Computational
Semantics (SIGSEM).
The main goal of *SEM is to provide a stable forum for
researchers working on different aspects of semantic processing.
Every *SEM conference includes a shared task in which training
and test data are provided by the organizers, allowing
participating systems to be evaluated and compared in a
systematic way. *SEM 2014 will release information about
shared task in Dec. or early Jan. 2014.
53

*SEM
*SEM 2012 shared task:
Description: Resolving the scope and the focus of
negation
Data: Stories by Conan Doyle, and WSJ PropBank Data
(about 8,000 sentences in total). All occurrences of
negation, their scope and focus are annotated.

*SEM 2013 shared task:


Description: Create a unified framework for the
evaluation of semantic textual similarity modules and
characterize their impact on NLP applications.
The data covers 5 areas: paraphrase sentence pairs
(MSRpar), sentence pairs from video descriptions
(MSRvid), MT evaluation sentence pairs (MTnews and
MTeuroparl) and gloss pairs (OnWN).

54

BioNLP
Overview
BioNLP shared tasks are organized by the ACLs
special Interest Group for biomedical natural
language processing.
BioNLP 2013 was the twelfth workshop on
biomedical natural language processing and held in
conjunction with the annual ACL or NAACL meeting.
BioNLP shared tasks are bi-annual event held with
the BioNLP workshop started from 2009. The next
event will be held in 2015.
55

BioNLP Past Shared Tasks


Year

Task

1. Genia Event Extraction from NFkB


Knowledge base construction
2. Cancer Genetics
3. Pathway Curation
4. Corpus Annotation with Gene
2013 Regulation Ontology

Data

Released
Date

End Date

Oct. 2012

NFKB Knowledge base

Apr. 2013

PubMed Literature
PubMed abstracts
PubMed Literature

7. Gene Regulation Network in Bacteria

Webpage documents
with general
information about
bacteria species
PubMed Abstracts

1. GENIA

PubMed abstracts

Dec.
2010

Apr. 2011

PubMed abstracts

PubMed abstracts
PubMed abstracts
PubMed abstracts
PubMed abstracts
PubMed abstracts
PubMed abstracts

Dec. 15
2008

Mar. 30
200956

5. Bacteria Biotopes

2. Epigenetics and Post-translational


Modifications
3. Infectious Diseases
2011
4. Bacteria Biotopes
5. Bacteria Interactions
6. Co-reference
7. Gene/Protein Entity Relations
8. Gene renaming
1. core event extraction(identify events
concerningwith the given proteins )
2009
2. Event enrichment

PubMed abstracts
PubMed abstracts

i2b2 Challenges
Informatics for Integrating Biology and the Bedside
(i2b2) is an NIH funded National Center for Biomedical
Computing (NCBC).
I2b2 center organizes data challenges to motivate the
development of scalable computational frameworks to
address the bottleneck limiting the translation of
genomic findings and hypotheses in model systems
relevant to human health.
I2b2 challenge workshops are held in conjunction with
Annual Meeting of American Medical Informatics
Association.
57

Previous i2b2 Challenges


Year

Task

Data

Release
Date

End
Date

2012

Temporal relation extraction

EHR

Jun. 2012

Sept.
2012

2011

Co-reference resolution

EHR

Jun. 2011

Sept.
2011

2010

Relation extraction on medical


problems

Discharge
summaries

Apr. 2010

Sept.
2010

2009

Medication extraction

Narrative patient
records

Jun. 2009

Sept.
2009

2008

Recognizing Obesity and comorbidities

Discharge
summaries

Mar. 2008

Sept.
2008

2006

De-identified discharge
summaries

Discharge
summaries

Jun. 2006

Sept.
2006

58

APPLYING TEXT MINING IN HEALTH


SOCIAL MEDIA RESEARCH:
AN EXAMPLE

59

Extracting Adverse Drug Events from


Health Social Forums

Online patient forums can provide valuable supplementary


information on drug effectiveness and side effects.

Those forums cover large and diverse population and contain data
directly from patients.
Patient forum ADE reports can serve as an economical alternative to
expensive and time-consuming patient-oriented drug safety data
collection projects.
It can help to generate new clinical hypothesis, cross-validate
the adverse drug events detected from other data sources, and
Post ID
Post Content
Contain
Report
conduct comparison studies.
ADE?

9043

I had horrible chest pain [Event] under Actos [Treatment].

12200

From what you have said, it seems that Lantus [Treatment] has had some negative side ADE

ADE

source

Patient
Hearsay

effects related to depression [Event] and mood swings [Event].


25139

I never experienced fatigue [Event] when using Zocor [Treatment].

34188

When taking Zocor [Treatment], I had headaches [Event] and bruising [Event].

63828

Another study of people with multiple risk factors for stroke [Event] found that Lipitor Drug
[Treatment] reduced the risk of stroke [Event] by 26% compared to those taking a Indication
placebo, the company said.

Negated
ADE
ADE

Patient
Patient
Diabetes
research

60

Test Bed
Discussion about
disease monitoring
and medical
products

Discussion
about disease
and medical
problems

Forum Name

Number
of Posts

American Diabetes
184,874
Association

Number of
Topics

Number of
Member Profiles

26,084

6,544

Diabetes Forums

568,684

45,830

12,075

Diabetes Forum

67,444

6,474

3,007

Time Span
2009.22012.11
2002.22012.11
2007.22012.11

Total Number
of Sentences
1,348,364
3,303,804
422,355
61

Extracting Adverse Drug Events from


Health Social Forums
Challenges
Topics in patient social media cover various sources, including
news and research, hearsay (stories of other people) and
patients experience. Redundant and noisy information often
masks patient-experienced ADEs.
Currently, extracting adverse event and drug relation in patient
comments results in low precision due to confounding with drug
indications (Legitimate medical conditions a drug is used for )
and negated ADE (contradiction or denial of experiencing ADEs)
in sentences.

Solutions
Develop relation extractor for recognizing and extracting
adverse drug event relations.
Develop a text classifier to extract adverse drug event reports
based on patient experience.
62

Extracting Adverse Drug Event from


Health Social Forums

Patient Forum Data Collection: collect patient forum data through a web crawler
Data Preprocessing: remove noisy text including URL, duplicated punctuation, etc,
separate post to individual sentences.
Medical entity extraction: identify treatments and adverse events discussed in
forum
Adverse drug event extraction: identify drug-event pairs indicating an adverse
drug event based on results of medical entity extraction
Report source classification: classify the source of reported events either from
patient experience or hearsay
63

Medical Entity Extraction

Initialize the medical entity


extraction with MetaMap
to match terms related to
drugs and ADEs in forum
discussion.

Filter the terms extracted


by MetaMap that never
appear in FAERS reports.

Query Consumer Health


Vocabulary for consumer
preferred terms of the
entities extracted by
MetaMap and look up those
consumer vocabularies in
the discussions.

MetaMap is a Java API that extract


medical terms in UMLS. The figure
below shows sample output of
MetaMap.

FAERS is FDAs knowledge base which


contains adverse drug event reports
filed by consumers, doctors and drug
companies.
ConsumerHealthVocabulary is a lexicon
for mapping consumer preferred terms
to terms in standard biomedical
64
ontology such as UMLS.

Adverse Drug Event


Extraction

Kernel based statistical learning


Feature generation
Generate representations of the relation
instances

Syntactic and semantic classes


mapping
Categorize lexical features into syntactic and
semantic classes to reduce the feature
sparsity

Shortest dependency path kernel


Compute the similarity score between two
relation instances

Semantic filtering
Drug indications from FAERS
Incorporate medical domain knowledge
for differentiating drug indication from
adverse events

NegEX
Incorporate linguistic knowledge to
identify negated adverse drug events.

Semantic templates
Form filtering templates using the
knowledge from FAERS and NegEX.
65

Rule based classification

Adverse Drug Event


Extraction
Feature generation

We utilized the Stanford Parser (http://nlp.stanford.edu/software/stanforddependencies.shtml) for dependency parsing.

The figure above shows the dependency tree of a sentence. In this sentence,
hypoglycemia is an adverse event and Lantus is a diabetes treatment.
Grammatical relations between words are illustrated in the figure. For instance,
cause and hypoglycemia have a relation dobj as hypoglycemia is the direct
object of cause. In this relation, cause is the governor and hypoglycemia is the
66
dependent.

Adverse Drug Event


Extraction
Syntactic and Semantic Classes Mapping

To reduce the data sparsity and increase the robustness of our method,
we expand shortest dependency path by categorizing words on the
path into syntactic and semantic classes with varying degrees of
generality.

Word classes include part-of-speech (POS) tags and generalized POS tags.
POS tags are extracted with Stanford CoreNLP packages. We generalized
the POS tags with Penn Tree Bank guidelines for the POS tags. Semantic
types (Event and Treatments) are also used for the two ends of the shortest
path.

Syntactic and Semantic Classes Mapping from


dependency graph

The relation instance in the figure above is represented as a sequence of features


X=[x1,x2,x3,x4,x5,x6,x7],
where x1={Hypoglycemia, NN, Noun, Event}, x2={->}, x3={cause, VB, Verb}, x4 ={<-},
x5={action, NN, Noun}, x6={<-}, x7={Lantus, NN, Noun, Treatment}.
67

Adverse Drug Event


Extraction
Shortest Dependency Path Kernel function

If x=x1x2xm and y=y1y2..yn are two relation examples,


where xi denotes the set of word classes corresponding to
position i, the kernel function is computed as in equation
below (Bunescu et al. 2005).

C ( xi , yi ) | xi yi |

is the number of common word classes between xi and yi.

Relation instance X=[{Hypoglycemia, NN, Noun, Event}, {->}, {cause, VB,


Verb}, {<-}, {action, NN, Noun}, {<-}, {Lantus, NN, Noun, Treatment}].
Relation instance y=[{depression, NN, Noun, Event}, {->}, {indicate, VBP,
Verb}, {<-}, {effect, NN, Noun}, {<-}, {Lantus, NNP, Noun, Treatment}].

K(x,y) can be computed as the product of the number of common


features xi and yi in position i.
K(x,y)=3*1*1*1*2*1*3=18.
68

Adverse Drug Event


Extraction
SVM Classification
There are a lot of SVM software/tools have been
developed and commercialized.
Among them, SVM-light package and LIBSVM are
two of the most widely used tools. Both are free
of charge and can be downloaded from the
Internet.
SVM-light is available at http://svmlight.joachims.org/
LIBSVM can be found at
http://www.csie.ntu.edu.tw/~cjlin/libsvm/

69

Adverse Drug Event


Extraction
SVM-light

70

Adverse Drug Event


Extraction
ALGORITHM .STATISTICAL LEARNING FOR ADVERSE DRUG EVENT
EXTRACTION
Input: all the relation instances with a pair of related drug and medical
events, R(drug, event).
Output: whether the instances have a pair of related drug and event
Procedure:
1. For each relation instance R(drug,event) :
Generate Dependency tree T of R(drug,event)
Features = Shortest Dependency Path Extraction (T, R)
Features = Syntactic and Semantic Classes Mapping (Features)
2. Separate relation instances into training set and test set
3. Train a SVM classifier C with shortest dependency kernel function based on
the training set
4. Use the SVM classifier C to classify instances in the test set into two classes
R(drug, event) = True and R(drug, event) = False.
71

Adverse Drug Event


Extraction
ALGORITHM .SEMANTIC FILTERING ALGORITHM
Input: a relation instance i with a pair of related drug and
medical events, R(drug, event).
Output: The relation type.
If drug exists in FAERS:
Get indication list for drug;
For indication in indication list:
If event= indication:
Return R(drug, event) = Drug Indication;
For rule in NegEX:
If relation instance i matches rule:
Return R(drug, event) = Negated Adverse Drug Event;
Return R(drug, event) = Adverse Drug Event;
72

Report Source Classification


In order to classify the report source of adverse
drug events, we developed a feature-based
classification model to distinguish patient reports
from hearsay based on the prior studies.
We adopted BOW features and Transductive
Support Vector Machines in SVM-light for
classification.

73

Evaluation on Medical Entity


Extraction
Results of Medical Entity Extraction
Precision
93.9%
91.7%92.5%

f-measure

92.5%
90.8%91.6%
87.3%

91.4%90.5%90.9%
86.5%

83.5%
80.3%

Recall

83.5%
80.7%

85.4%
82.3%
79.5%

The performance of our system (F-measure) surpasses the best


performance in prior studies ( F-measure73.9% ), which is achieved
by applying UMLS and MedEffect to extract adverse events from
DailyStrength (Leaman et al., 2010). There may be several causes
for our approach to outperform prior work.

Combination of multiple lexicons improves precision.


DailyStrength is a general health social website where users may have more
diverse health vocabulary and develop more linguistic creativity. Extracting
medical named entities could be more difficult than our data source.

74

Evaluation on Adverse Drug Event


Extraction
Resultsof AdverseDrugEventExtrac on
Precision
100.0%

Recall

F-measure

100.0%

100.0%

82.0%
55.6%

62.0%
56.5%59.2%

CO

SL
American Diabetes Associa on

61.9%

64.2%60.4%62.2%

SL+SF

75.2%

68.3%
60.4%

44.8%

38.5%

78.6%

66.9%
56.6%

59.6%

62.5%
58.0%60.2%

65.5%
58.0%

41.5%

CO

SL
DiabetesForums

SL+SF

CO

SL

SL+SF

Diabetes Forum

Compared to co-occurrence based approach (CO), statistical learning


(SL) contributed to the increase of precision from around 40% to
above 60% while the recall dropped from 100% to around 60%. Fmeasure of SL is better than CO method.
Semantic filtering (SF) further improved the precision in extraction
from 60% to about 80% by filtering drug indications and negated
ADEs.
75

Evaluation on Report Source


Classification
Results of Report Source Classification
Precision
100.0%
76.2%
61.5%

Recall

F-measure

100.0%
83.9%84.3%84.1%

100.0%
81.2%83.1%82.1%

80.2%82.4%81.3%

69.0%
52.7%

67.9%
51.4%

Without report source classification (RSC), the performance of extraction is


heavily affected by noise in the discussion.
The precision ranged from 51% to 62% without RSC.
Overall performance (F-measure) ranged from 68% to 76%

After report source classification, the precision and F-measure significantly


improved.
The precision increased from 51% up to 84%
The overall performance (F-measure ) increased from 68% to above 80%.

76

Contrast of Our Proposed Framework to Cooccurrence based approach


Contrast of Our Proposed Framework to Co-occurrence based approach
Total Relation Instances
100%

Adverse Drug Events

100%

100%

21.94%
1069

39.27%

37.98%

35.97%

2972

Patient Reported ADEs

652

American Diabetes Association

19.74%
365
2

1387

Diabetes Forums

721

18.10%
107
2

421

194

Diabetes Forum

There are a large number of false adverse drug events which


couldnt be filtered out by co-occurrence based approach.
Based on our approach , only 35% to 40% of all the relation
instances contain adverse drug events.
Among them, about 50% comes from patient reports.

77

References

*SEM: http://ixa2.si.ehu.es/starsem/
CoNLL: http://ifarm.nl/signll/conll/
SemEval: http://alt.qcri.org/semeval2014/
CLEF eHealth: http://clefehealth2014.dcu.ie/home
BioNLP: http://2013.bionlp-st.org/
I2b2:https://www.i2b2.org/
Benton A., Ungar L., Hill S., Hennessy S., Mao J., Chung A., & Holmes J. H. (2011). Identifying potential adverse
effects using the web: A new approach to medical hypothesis generation. Journal of biomedical informatics, 44(6),
pp. 989-996.
Bian, J., Topaloglu, U., & Yu, F. (2012). Towards large-scale twitter mining for drug-related adverse events.
InProceedings of the 2012 ACM International Workshop on Smart health and wellbeing,pp. 25-32.
Bunescu R.C., Mooney R.J. (2005). A Shortest Path Dependency Kernel for Relation Extraction. In: Proceedings of the
conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 724-731.
Chee B. W., Berlin R., & Schatz B. (2011). Predicting adverse drug events from personal health messages. In:AMIA
Annual Symposium ProceedingsVol. 2011, pp. 217-226
Culotta, A., & Sorensen, J. (2004, July). Dependency tree kernels for relation extraction. InProceedings of the 42nd
Annual Meeting on Association for Computational LinguisticsAssociation for Computational Linguistics, pp. 423-429.
Leaman R., Wojtulewicz L, Sullivan R, Skariah A., Yang J, Gonzalez G. (2010) Towards Internet- Age
Pharmacovigilance: Extracting Adverse Drug Reactions from User Posts to Health-Related Social Networks, In:
Proceedings of the 2010 Workshop on Biomedical Natural Language Processing, ACL, pp.117-125.
Liu, X., & Chen, H. (2013). AZDrugMiner: an information extraction system for mining patient-reported adverse drug
events in online patient forums. InSmart Health.Springer Berlin Heidelberg, pp. 134-150.
Yang C. C., Yang H., Jiang L., & Zhang M. (2012). Social media mining for drug safety signal detection.
In:Proceedings of the 2012 international workshop on Smart health and wellbeingACM, pp. 33-40.
Zelenko D., Aone C. and Richardella A(2003): Kernel methods for relation extraction. Journal of Machine Learning
Research, 3, pp.1083-1106.

78

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy