0% found this document useful (0 votes)

259 views78 pages

1 Text Mining Review Slides

The document provides an overview of text mining techniques, tools, ontologies, and shared tasks. It discusses common text mining techniques like text classification, sentiment analysis, topic modeling, named entity recognition, and entity relation extraction. It also outlines popular open source text mining tools and ontologies used in text mining applications.

Uploaded by

Walid_Sassi_Tun

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

259 views78 pages

1 Text Mining Review Slides

Uploaded by

Walid_Sassi_Tun

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 78

TEXT MINING:

TECHNIQUES, TOOLS,
ONTOLOGIES AND SHARED TASKS
14 Spring

Introduction
Text mining, also referred to as text data mining, refers
to the process of deriving high quality information from
text.
Text mining is an interdisciplinary field that draws on
information retrieval, data mining, machine
learning, statistics and computational linguistics.
Text mining techniques have been applied in a large
number of areas, such as business intelligence,
national security, scientific discovery (especially life
science), social media monitoring and etc..
2

Introduction
In this set of slides, we are going to cover:
the most commonly used text mining
techniques
Ontologies that are often used in text mining
Open source text mining tools
Shared tasks in text mining which reflect the
hot topics in this area
A research case which applies text mining
techniques to solve a healthcare related
problem with social media data.
3

TEXT MINING
Text Classification
TECHNIQUES
Sentiment Analysis
Topic Modeling
Named Entity Recognition
Entity Relation Extraction
4

Text Classification
Text Classification or text categorization is a problem
in library science, information science, and computer
science. Text classification is the task of choosing
correct class label for a given input.
Some examples of text classification tasks are
Deciding whether an email is a spam or not (spam
detection) .
Deciding whether the topic of a news article is from a fixed
list of topic areas such as sports, technology, and
politics (document classification).
Deciding whether a given occurrence of the word bank is
used to refer to a river bank, a financial institution, the act of
tilting to the side, or the act of depositing something in a
5
financial institution (word sense disambiguation).

Text Classification

Text classification is a supervised machine learning task as it

is built based on training corpora containing the correct label for
each input. The framework for classification is shown in figure
below.

(a) During training, a feature extractor is used to convert each input value to a feature set. These feature sets,
which capture the basic information about each input that should be used to classify it, are discussed in the
next section. Pairs of feature sets and labels are fed into the machine learning algorithm to generate a model.
(b) During prediction, the same feature extractor is used to convert unseen inputs to feature sets. These
6
feature sets are then fed into the model, which generates predicted labels.

Text Classification
Common features for text classification include:
bag-of words (BOW), bigrams, tri-grams and part-ofspeech(POS) tags for each word in the document.
The most commonly adopted machine learning
algorithms for text classifications are nave Bayes,
support vector machines, and maximum
entropy
classifications.
Algorithm
Language
Tools
Java
Python
C++
Support Vector
MatLab
Machines
Java
Java
Maximum entropy
Python
Nave Bayes

Weka, Mahout, Mallet

NLTK
SVM-light, mySVM, LibSVM
SVM Toolbox
Weka
Mallet
NLTK
7

Sentiment Analysis
Sentiment analysis (also known as opinion mining) refers
to the use of natural language processing, text analysis
and computational linguistics to identify and extract
subjective information in source material.
The rise of social media such as forums, micro blogging
and blogs has fueled interest in sentiment analysis.
Online reviews, ratings and recommendations in social media sites
have turned into a kind of virtual currency for businesses looking to
market their products, identifying new opportunities and manage
their reputations.
As businesses look to automate the process of filtering out the
noise, identifying relevant content and understanding reviewers
opinions, sentiment analysis is the right technique.
8

Sentiment Analysis

The main tasks, their descriptions and approaches are

summarized in the table below:
Task

description

Polarity
classifying a given text at the document,
Classificatio sentence, or feature/aspect level into
n
positive, negative or neutral

Affect
Analysis

Classifying a given text into affect states

such as "angry", "sad", and "happy"

Subjectivity Classifying a given text into two classes:

Analysis
objective and subjective
Determining the opinions or sentiment
Feature/Asp
expressed on different features or aspects
ect Based
of entities (e.g., the screen[feature] of a
Analysis
cell phone [entity])
Opinion
Detecting the holder of a sentiment (i.e.

Approaches
lexicon based
scoring
machine
learning
classification
lexicon based
scoring
machine
learning
classification
lexicon based
scoring
machine
learning
classification
Named entity
recognition +
entity relation
detection
Named entity

lexicons/
algorithms
SentiWordNet,
LIWC
SVM
WordNet-Affect
SVM
SentiWordNet,
LIWC
SVM
SentiWordNet,
LIWC, WordNet
SVM
9
SentiWordNet,

Topic Modeling

Topic models are a suite of algorithms for discovering the main

themes that pervade a large and otherwise unstructured
collection of documents.

Topic Modeling algorithms include Latent Semantic

Analysis(LSA), Probability Latent Semantic Indexing (PLSI), and
Latent Dirichlet Allocation (LDA).
Among them, Latent Dirichlet Allocation (LDA) is the most
commonly used nowadays.

Topic modeling algorithms can be applied to massive collections

of documents.
Recent advances in this field allow us to analyze streaming
collections, like you might find from a Web API.

Topic modeling algorithms can be adapted to many kinds of

Topic Modeling - LDA

The figure below shows the intuitions behind latent Dirichlet allocation. We assume that
some number of topics, which are distributions over words, exist for the whole collection (far
left). Each document is assumed to be generated as follows. First choose a distribution over the
topics (the histogram at right); then, for each word, choose a topic assignment (the colored
coins) and choose the word from the corresponding topic .

Topic Modeling - LDA

The figure below show real inference with LDA. 100-topic LDA model is fitted to
17,000 articles from journal Science. At left are the inferred topic proportions for the
example article in previous figure. At right are the top 15 most frequent words from
the most frequent topics found in this article.

Topic Modeling - Tools

Name
lda-c
class-slda

Model/Algorith Languag
m
e
Author
Latent Dirichlet
C
D. Blei
allocation
Supervised topic
models for
C++
C. Wang
classification

Notes
This implements variational inference
for LDA.
Implements supervised topic models
with a categorical response.

lda

R package for
Gibbs sampling
in many models

J. Chang

Implementsmanymodels and isfast.

Supports LDA, RTMs (for networked
documents), MMSB (for network data),
and sLDA (with a continuous
response).

tmve

Topic Model
Visualization
Engine

Python

A. Chaney

A package for creating corpus

browsers.

dtm

Dynamic topic
models and the
influence model

C++

Correlated topic
C
models
LDA,
Mallet
Java
Hierarchical LDA
LDA, Labeled
Stanford topi LDA, Partially
Java
c modeling to Labeled LDA
ctm-c

S. Gerrish
D. Blei
A.
McCallum

This implements topics that change

over time and a model of how
individual documents predict that
change.
This implements variational inference
for the CTM.
Implements LDA and Hierarchical LDA

Stanford
Implements LDA, Labeled LDA, and
13
NLP Group PLDA

Named Entity Recognition

Named entity refers to anything that can be referred to with

a proper name.
Named entity recognition aims to
Find spans of text that constitute proper names
Classify the entities being referred to according to their type

Type

Sample
Categories

Example

People

Individuals, fictional
Characters

Turing is often considered to be the father of modern

computer science.

Organization

Companies, parties

Amazon plans to use drone copters for deliveries.

Location

Mountains, lakes, seas

Geo-Political

Countries, states,
provinces

TheCatalinas, are located north, and northeast

ofTucson,Arizona,United States.

Facility

Bridges, airports

In the late 1940s, Chicago Midway was the busiest

airport in the United States by total aircraft operations.

The highest point in the Catalinas isMount Lemmonat

an elevation of 9,157 feet above sea level.

Planes, trains, cars

The updated Mini Cooper retains its charm and agility.
Vehicles
In practice, named entity recognition can be extended to types that are not in the
table above, such as temporal expressions (time and dates), genes, proteins,
14
medical related concepts (disease, treatment and medical events) and etc..

Named Entity Recognition

Named entity recognition techniques can be
categorized into knowledge-based approaches
and machine learning based approaches.
Category

Advantage

Knowledgebased
approach

Require little
training data

Machine
learning
approach
- Conditional
Random Field
(CRF)
- Hidden
Markov Model
(HMM)

Reduced
human effort
in
maintaining
rules and
dictionaries

Disadvantage
Creating lexicon
manually is timeconsuming and
expensive;
encoded
knowledge might
be importable
across domains.

Tools /Ontology
General Entity Types
WordNet
Lexicons created by experts
Medical domain:
GATE (University of Sherfield)
UMLS (National library of Medicine)

MedLEE (Originally from Columbia

University, commericalized now)
Conditional Random Field tools
Stanford NER
CRF++
Prepared a set of Mallet
annotated training Hidden Markov Model tools
data
Mallet
Natural Language Toolkit(NLTK)

Entity Relation Extraction

Entity relation extraction discerns the
relationships that exist among the entities
detected in a text. Entity relation extraction
techniques are applied in a variety of areas.
Question Answering
Extracting entities and relational patterns for answering
factoid question

Feature/Aspect based Sentiment Analysis

Extract relational patterns among entity, features and
sentiments in text R(entity, feature, sentiment).

Mining bio-medical texts

Protein binding relations useful for drug discovery
Detection of gene-disease relations from biomedical literature
Finding drug-side effect relations in health social media
16

Entity Relation Extraction

Entity relation extraction approaches can be
categorized into three types
Category

Method

Advantage

Disadvantag
e

Tools

Cooccurrence
Analysis

If two entities cooccur within certain

distance, they are
considered to have a
relation

Simplicity and
flexibility; high
recall

Low precision;
cant decide
relation types

Rule-based
approaches

Create rules for

relation extraction
based on syntactic
and semantic
information in the
sentences

General, flexible;

Lower
portability across
different
domains
Manual
encoding of
syntactic and
semantic rules

Syntactic
information:
Stanford Parser;
OpenNLP;
Semantic
information:
Domain
Knowledge bases

Supervised
Learning

Feature-based
methods: feature
representation
Kernel-based
methods:

Little or no
manual
development of
rules and
templates

Annotated
corpora is
required.

Dan Bikels
parser;
MST parser;
Stanford parser;
17
SVM classifier:

Supervised Learning Approaches for

Entity Relation Extraction
Supervised learning approach breaks relation extraction
into two subtasks (relation detection and relation
classification). Each task is a text classification
problem.
Classifier 1:
Detect when a
relation is
present
between two
entities

Classifier 2:
Classify the
relation types

Supervised learning approach can be categorized by

Feature based
feature based methods and
kernel
based methods.
methods
Feature
Extraction

Sentences

Text Analysis
(POS, Parse Trees)

Classifier
Kernel
Function
Kernel based methods

Supervised Learning Approach to

Entity Relation Extraction
Feature based methods rely on features to
represent instances for classification. The
features for relation extraction can be categorized
Entity-based
into: features
Word-based features
Syntactic features
Entity types of the two
candidate arguments

Bag-of-words and bag-ofbigrams between entities

Presence of particular
constructions in a constituent
structure

Concatenation of the two

entity types

Stemmed version of Bag-ofwords and bag-of-bigrams

between entities

Chunk based-phrase paths

Headwords

Words and stems immediately

preceding and following the
entities

Bags of chunk heads

Bag-of-words from the

arguments

Distance in words between

the arguments

Dependency-tree paths

Number of entities between

the arguments

Constituent-tree paths
Tree distance between the
19
arguments

Supervised Learning Approach to

Entity Relation Extraction
Kernel-based methods are an effective alternative to
explicit feature extraction.
They retain the original representation of objects and use
the object only via computing a kernel function between a
pair of objects.

Kernel K(x,y) defines similarity between objects x

and y implicitly in a higher dimensional space.
Commonly
used kernel functions
Author
Kernels
Description for relation
Node attributes
Zelenko et Shalloware:
Parse Tree
entity type,word,
extractions
al. 2003
Kernel
Use shallow parse trees
POS tag
Culotta et
al. 2004

Dependency tree
kernel

Bunescu et Shortest dependency

al. 2005
path kernel

Use dependency parse

trees
shortest path between
entities in a dependency
tree

Word, POS,
Generalized POS,
Chunk tag, Entity
Type, Entity level
Word, POS,
Generalized POS,
Entity Type

Ontology

Ontology represents knowledge as a set of concepts with a

domain, using a shared vocabulary to denote types, properties,
and interrelationships of those concepts.
In text mining, ontology is often used to extract named entities,
detect entity relations and conduct sentiment analysis.
Commonly used ontologies are listed in the table below:
Name
WordNet

Creator
Princeton University

Andrea Esuli, Fabrizio

Sebastian
James W. Pennebaker,
Linguistic Inquiry and W Roger J. Booth, Martha
ord Count(LIWC)
E. Francis
SentiWordNet

Description

Application

A large lexical database of English.

Word sense
disambiguation
Text summarization
Text similarity analysis

SentiWordNet a lexical resource for opinion

mining.

Sentiment analysis

LIWC is a lexical resource for sentiment analysis.

Sentiment analysis
Affect analysis
Deception detection

TheUnified Medical Language System(UMLS) is

Medical entity
acompendiumof manycontrolled
recognition
vocabulariesin thebiomedicalsciences.
Medical entity
Canadian Adverse Drug
A knowledge base about drug and side effect in recognition
MedEffect
Reaction Monitoring
Canada
Drug safety
Program(CADRMP)
surveillance
Medical entity
Mapping consumer health vocabulary to
Consumer Health Vocabu University of Utah
recognition, Health
standard medical terms in UMLS.
lary (CHV)
social media analytics
21
Documenting adverse drug event reports and
Medical entity
FDAs Adverse Event Rep United States Food and
drug indications of all the medical products in
US National Library of
Unified Medical Languag
Medicine
e System (UMLS)

WordNet
WordNet is an online lexical database in
which English nouns, verbs, adjectives and
adverbs are organized into sets of
synonyms.
Each word represents a lexicalized concept.
Semantic relations link the synonym sets
(synsets).

WordNet contains more than 118,000

different word forms and more than 90,000
senses.
Approximately 17% of the words in WordNet are
22
polysemous (have more than on sense); 40%

WordNet
Six semantic relations are presented in WordNet because they apply broadly
throughout English and because a user need not have advanced training in
linguistics to understand them. The table below shows the included
semantic relations.
Semantic Relation

Syntactic Category

Examples

Synonymy
(similar)

Noun, Verb, Adjective,

Adverb

Pipe, tube
Rise, ascent
Sad, happy
Rapidly, speedily

Antonymy
(opposite)

Adjective, Adverb

Wet, dry
Powerful, powerless
Rapidly, slowly

Hyponymy
(subordinate)

Noun

Maple, tree
Tree, plant

Meronymy
(part)

Noun

Brim, hat
Ship, fleet

Troponomy
(manner)

Verb

March, walk
Whisper, speak

Entailment

Verb

Drive, ride
Divorce, marry

WordNet has been used for a number of different purposes in information

systems, includingword sense disambiguation,information retrieval,text
classification,text summarization,machine translationand semantic textual
23
similarity analysis .

SentiWordNet

SentiWordNet is a lexical resource explicitly devised for

supporting sentiment analysis and opinion mining applications.
SentiWordNet is the result of the automatic annotation of all
the synsets of WordNet according to the notions of positivity,
negativity and objectivity.
Each of the positivity, negativity and objectivity scores
ranges in the interval [0.0,1.0], and their sum is 1.0 for each
synset.

The figure above shows the graphical representation adopted by

SentiWordNet for representing the opinion-related properties of a term sense.24

SentiWordNet
In SentiWordNet, different senses of the same
term may have different opinion-related
Search
properties.
term

Sense 1
Positivity,
objectivity and
negativity score

Sense 3

Sense 2

Synonym of
estimable in
this sense

The figure above shows the visualization of opinion related properties of the term estimable
in SentiWordNet (http://sentiwordnet.isti.cnr.it/search.php?q=estimable).

Linguistic Inquiry and Word Count

(LIWC)
Linguistic Inquiry and Word Count (LIWC) is a text
analysis program that looks for and counts word in
psychology-relevant categories across text files.
Empirical results using LIWC demonstrate its ability to
detect meaning in a wide variety of experimental
settings, including to show attentional focus,
emotionality, social relationships, thinking
styles, and individual differences.
LIWC is often adopted in NLP applications for sentiment
analysis, affect analysis, deception detection and etc..
26

Linguistic Inquiry and Word Count

(LIWC)
The LIWC program has two major components: the
processing component and the dictionaries.
Processing
Opens a series of text files (posts, blogs, essays, novels, and so
on)
Each word in a given text is compared with the dictionary file.

Dictionaries: the collection of words that define a

particular category
English dictionary: over 100,000 words across over 80
categories examined by human experts.
Major categories: functional words, social processes,
affective processes, positive emotion, negative emotion,
cognitive processes, biological processes, relativity and
etc..
Multilingual: Arabic, Chinese, Dutch, French, German, Italian,
Portuguese, Russian, Serbian, Spanish and Turkish.
27

Linguistic Inquiry and Word Count

(LIWC)
LIWC
results
from input
text
LIWC
categori
es

LIWC results
from personal
text and formal
writing for
comparison

Input text: A post

from a 40 year old
female member in
American Diabetes
Association online
community

LIWC online demo:

Unified Medical Language System

(UMLS)
The Unified Medical Language System (UMLS) is a
repository of biomedical vocabularies developed
by the US National Library of Medicine.
UMLS integrates over 2.5 million names for 900,551
concepts from more than 60 families of biomedical
vocabularies, as well as 12 million relations among these
concepts.
Ontologies integrated in the UMLS Metathesaurus include
the NCBI taxonomy, Gene Ontology (GO), the Medical
Subject Headings (MeSH), Online Mendelian Inheritance in
Man(OMIM), University of Washington Digital Anatomist
symbolic knowledge base (UWDA) and Systematized
Nomenclature of MedicineClinical Terms(SNOMED CT). 29

Unified Medical Language System

(UMLS)
Major Ontologies integrated in
UMLS

Name

Creator

National Center for Biotec

hnology Information (NCBI)
Taxonomy
University of Washington D
igital Anatomist Source In
formation (UWDA)
Gene Ontology(GO)

National Library of Medicine

Gene Ontology Consortium

Gene product characteristics and gene

product annotation data

Medical Subject Headings

(MeSH)

National Library of Medicine

Vocabulary thesaurus used for indexing

articles for PubMed

McKusick-Nathans Institute
of Genetic Medicine
Johns Hopkins University

human genes and genetic phenotypes

Annotate human
genes

Comprehensive, multilingual clinical

healthcare terminology in the world

Identify clinical
30
terms

Online Mendelian Inheritan

ce in Man(OMIM)

Description
All of the organisms in public sequence
database

Symbolic models of the structures and

University of Washington
relationships that constitute the human
Structural Informatics Group
body.

College of American
Systematized Nomenclature
Pathologists
of Medicine--Clinical Te

Application
Identify
organisms
Identify terms in
anatomy
Gene product
annotation
Cover terms in
biomedical
literature

Unified Medical Language System

(UMLS)
Accessing UMLS data
No fee associated, license agreement required
Available for research purposes, restrictions apply for
other kinds of applications

UMLS related tools

MetamorphoSys (command line program)
UMLS installation wizard and customization tool
Selecting concepts from a given sub-domain
Selecting the preferred name of concepts

MetaMap (Java)
Extracts UMLS concepts from text
Variable length of input text
Outputs a ranked listed of UMLS concepts associated with
input text

MedEffect
MedEffect is the Canada Vigilance Adverse Reaction
Online Database, which contains information about
suspected adverse reactions to health products.
Report submitted by consumers and health professionals
Containing a complete list of medications, adverse
reactions and drug indications (medical conditions for legit
use of medication)

MedEffect is often used in healthcare research for

annotating medications and adverse reactions from
text (Leaman et al. 2010; Chee et al. 2011).
32

Consumer Health
Vocabulary (CHV)
Consumer Health Vocabulary (CHV) is a lexicon linking
UMLS standard medical terms to health consumer
vocabulary.
Laypeople have different vocabulary from healthcare professionals
to describe medical problems.
CHV helps to bridge the communication gap between consumers
and healthcare professionals by mapping the UMLS standard
medical terms to consumer health language.

It has been applied in prior studies to better understand

and match user expressions for medical entity extraction
in social media (Yang et al. 2012; Benton et al. 2011).
33

FDAs Adverse Event Reporting

System (FAERS)
FDAs Adverse Event Reporting System(FAERS)
documents adverse drug event reports and drug
indications of all the medical products in US market.
Reports submitted by consumers, health professionals,
pharmaceutical companies and researchers.
Containing complete list of medical products in United
States and their suspected adverse reactions

FAERS has been applied in healthcare research for

medical named entity recognitions and adverse
drug event extractions (Bian et al. 2012, Liu et al.
2013).
34

A-Z LIST OF OPEN SOURCE

NLP TOOLKITS
35

Name
Antelope
framework
Apertium

ClearTK

cTakes

DELPH-IN

Factorie
FreeLing
General
Architecture
for Text
Engineering
(GATE)
Graph
Expression

Main Features
Part-of-speech tagging, dependency parsing, WordNet
lexicon
Machine translation for language pairs from Spanish,
English, French, Portuguese, Catalan and Occitan
Wrappers for machine learning libraries(SVMlight,
LibSVM, OpenNLP MaxEnt) and NLP tools (Snowball
Stemmer, OpenNLP, Stanford CoreNLP)

Languag
e

C#,VB.n
Proxem
et
C+
(various)
+,Java
The Center for
Computational
Language and
Java
Education
Research at
theUniversity of
Colorado Boulder

Sentence boundary detection, tokenization,

normalization, POS tagging, chunking, context(family
history, symptoms, disease, disorders, procedures)
Java
annotator, negation detection, dependency parsing, drug
mention annotator
Deep linguistic analysis: head-driven phrase structure
grammar (HPSG) and minimal recursion semantic
parsing
scalable NLP toolkit for named entity recognition,
relation extraction, parsing, pattern matching, and topic
modeling(LDA)
Tokenization, sentence splitting, contradiction splitting,
morphological analysis, named entity recognition, POS
tagging, dependency parsing, co -reference resolution

Creators

LISP,C+
+
Java
C++

Children's
Hospital Boston,
Mayo Clinic

Websit
e
[1]
[2]

[3]

[4]

Deep Linguistic
Processing
[5]
withHPSGInitiati
ve
University of
Massachusetts
[6]
Amherst
Universitat
Politcnica de
[7]
Catalunya

Information extraction(tokenization, sentence splitter,

POS tagger, named entity recognition, coreference
resolution), machine learning library wrapper(Weka,
MaxEnt, SVMLight, RASP, LibSVM), Ontology (WordNet)

Java

GATE open
source
community

[8]

Information extraction (named entity recognition,

relation and fact extraction, parsing and search problem
solving)

Java

Startup huti.ru

[9]

Name

Main Features

Languag
e

Creators

Website

Java

Cognitive
Computation
Group at UIUC

[10]

LingPipe

Topic classification, named entity recognition, clustering,

POS tagging, spelling correction, sentiment analysis,
logistic regression, word sense disambiguation

Java

Alias-i

[11]

Mahout

Scalable machine learning libraries (logistic regression,

Nave Bayes, Random Forest, HMM, SVM, Neural Network,
Boosting, K-means, Fuzzy K-means, LDA, Expectation
Maximization, PCA )

Java

Online
community

[12]

Mallet

Document classification(Nave Bayes, Maximum Entropy,

decision trees), sequence tagging (HMM, MEMM, CRF),
topic modeling (LDA, Hierarchical LDA)

Java

University of
Massachusetts
Amherst

[13]

Map biomedical text to the UMLS Metathesaurus and

discover Metathesaurus concepts referred to in text.

Java

MII nlp toolkit

de-identification tools for free-text medical reports

Java

MontyLingua

Tokenization, POS tagging, chunking, extractors for

phrases and subject/verb/object tuples from sentences,
morphological analysis, text summarization

Python,
Java

MIT

[16]

Natural
Language
Toolkit(NLTK)

Interface to over 50 open access corpora, lexicon

resource such as WordNet, text processing libraries for
classification, tokenization, stemming, POS tagging,
parsing and semantic reasoning.

Python

Online
community

[17]

NooJ(based
onINTEX)

Morphological analysis, syntactic parsing, named entity

recogntion

.NET
Framewo
rk-based

University of
FrancheComt,France

[18]

Learning Based POS tagger, Chunking, coreference resolution, named

Java
entity recognition

MetaMap

National Library
of Medicine
UCLA Medical
Imaging
Informatics (MII)
Group

[14]

[15]

Name

Main Features

OpenNLP

Tokenization, sentence segmentation, POS tagging, named

entity extraction, chunking, parsing, coreference resolution

Pattern

PSI-Toolkit

Online community [19]

Adam Mickiewicz
University in
Pozna

[21]

Scala

David Hall and

Daniel Ramage

[22]

Java

The Stanford
Natural Language
Processing Group

[23]

C++

University of
Cambridge,
University of
Sussex

[24]

Tokenization, stemming, classification (Nave Bayes, logistic JavaScript

Chris Umbel
regression),morphological analysis, WordNet
,NodeJs

[25]

[26]

Stanford NLP

Treex

Website

[20]

Tokenization, POS tagging, named entity recognition,

parsing, coreference, topic modeling, classification (Nave
Bayes, logistic regression, maximum entropy), sequence
tagging(CRF)

Text
Engineering
Software
Laboratory
(Tesla)

Creators

Tom De Smedt,
CLiPS,University
of Antwerp

ScalaNLP

Natural

Java

Wrapper for Google, Twitter and Wikipedia API, web crawler,

HTML DOM parsing, POS tagging, n-gram search, sentiment
analysis, WordNet, machine learning algorithms for
Python
clustering and classification, network analysis and
visualization
Text preprocessing, sentence splitting, tokenization, lexical
and morphological analysis, syntactic/ semantic parsing,
C++
machine translation
Tokenization, POS tagging, sentence segmentation,
sequence tagging (CRF, HMM), machine learning
algorithms(linear regression, Nave Bayes, SVM, K-Means,
LDA, Neural Network )

Rasp

Language

Tokenization, POS tagging, lemmatization, parsing

Tokenization, POS tagging, sequence alignment

Java

University of
Cologne

Machine translation

Perl

Charles University
[27]
in Prague

Languag
e

Name

Main Features

UIMA

Industry standard for content analytics, contains a set of

rule based and machine learning annotators and tools

Java/C+
Apache
+

[28]

Tokenization, POS tagging, named entity recognition,

classification, text summarization

NLP++ /
compiles
to C++

[29]

VisualText

Language identification, named entity recognition,

WebLab-project semantic analysis, relation extraction, text classification
and clustering, text summarization

Creators

Text Analysis
International, Inc

Java/C+
OW2
+

Website

[30]

UniteX

Tokenization, sentence boundary detection, parsing,

morphological analysis, rule-based named entity
recognition, text alignment, word sense disambiguation

Java&C
++

Laboratoire
d'Automatique
Documentaire et
Linguistique

[31]

The Dragon
Toolkit

tools for accessing PubMed, TREC collection, NewsGroup

articles, Reuters Articles, and Google Search Engine,
ontologies(UMLS, WordNet, MeSH), tokenization,
stemming, POS tagging, named entity recognition,
classification(Nave Bayes, SVM-light, LibSVM, logistic
regression), clustering(K-Means, hierarchical clustering),
topic modeling(LDA), text summarization,

Java

Drexel University

[32]

Text Extraction,
Annotation and
Retrieval
Toolkit

Tokenization, chunking, sentence segmenting, parsing,

ontology(WordNet), topic modeling(LDA), named entity
recognition, stemming, machine learning
algorithms(decision tree, SVM, neural network)

Ruby

Louis Mullie

[33]

Zhihuita NLP
API

Chinese text segmentation, spelling checking, pattern

matching,

Zhihuita.org

[34]

SHARED TASKS (COMPETITIONS) IN

HEALTHCARE AND NATURE LANGUAGE
PROCESSING DOMAINS

Introduction
Shared task series in Nature Language Processing often represent a
community-wide trend and hot topics which are not fully explored in the
past.
To keep up with the state-of-the-art techniques and new research topics in
NLP community, we explore major conferences, workshops, special
interest groups belonging to Association for Computational Linguistics
(ACL).
We organize our findings into two categories: ongoing shared tasks and
watch list.
Ongoing list contains competitions that have already made task descriptions, data and
schedules for 2014 publicly available.
International Workshop on Semantic Evaluation (SemEval)
CLEF eHealth Evaluation Lab

Watch list contains competitions that havent made content available but are relevant
to our interests.

Conference on Nature Language Learning (CoNLL) Shared Tasks

Joint Conference on Lexical and Computational Semantics (*SEM) Shared Tasks
BioNLP
i2b2 Challenge

SemEval
Overview
SemEval, International Workshop on Semantic Evaluation,
is an ongoing series with evaluation of computational
semantic analysis systems. It evolved from the SensEval
(word sense evaluation) series.
SIGLEX, a Special Interest Group on Lexicon of the
Association for Computational Linguistics, is the umbrella
organization for the SemEval.
SemEval- 2014 will be the 8th workshop on semantic
evaluation. The workshop will be co-located with the 25th
International Conference on Computational Linguistics
(COLING) in Dublin, Ireland.
42

SemEval

Past workshops
Workshop

No. of
Tasks

Areas of study

Languages of Data Evaluated

Senseval1(1998)

Word Sense Disambiguation (WSD) - Lexical Sample WSD tasks

English, French, Italian

Senseval2(2001)

Czech, Dutch, English, Estonian,

Word Sense Disambiguation (WSD) - Lexical Sample, All Words, Basque, Chinese, Danish, English,
Translation WSD tasks
Italian, Japanese, Korean, Spanish,
Swedish

Senseval3(2004)

Logic Form Transformation, Machine Translation (MT) Evaluation, Basque, Catalan, Chinese, English,
Semantic Role Labeling, WSD
Italian, Romanian, Spanish

Cross-lingual, Frame Extraction, Information Extraction, Lexical

Substitution, Lexical Sample, Metonymy, Semantic Annotation,
Arabic, Catalan, Chinese, English,
Semantic Relations, Semantic Role Labeling, Sentiment Analysis,
Spanish, Turkish
Time Expression, WSD

SemEval-2010

Co-reference, Cross-lingual, Ellipsis, Information Extraction,

Catalan, Chinese, Dutch, English,
Lexical Substitution, Metonymy, Noun Compounds, Parsing,
French, German, Italian, Japanese,
Semantic Relations, Semantic Role Labeling, Sentiment Analysis,
Spanish
Textual Entailment, Time Expressions, WSD

SemEval-2012

Common Sense Reasoning, Lexical Simplification, Relational

Similarity, Spatial Role Labeling, Semantic Dependency Parsing,
Semantic and Textual Similarity

Chinese, English

Temporal Annotation, Sentiment Analysis, Spatial Role Labeling,

Noun Compounds, Phrasal Semantics, Textual Similarity,
Response Analysis, Cross-lingual Textual Entailment, BioMedical
Texts, Cross and Multi-lingual WSD, Word Sense Induction, and
Lexical Sample

Catalan, French, German, English,

Italian, Spanish

SemEval-2007

SemEval-2013

Tas
k ID

SemEval-2014
Task Name
Evaluation of
compositional
distributional
semantic models
(CDSMs) on full
sentences

Description
Subtask A: predicting the degree of
relatedness between two sentences
Subtask B: detecting the entailment
relation holding between them

Creating clusters consisting of

semantically similar fragments.
Grammar
For example, the following two
Induction for
fragments: depart from <City> and
Spoken Dialogue
fly out of <City> are in the same
Systems
cluster as they refer to the concept of
departure city.
Evaluating similarity across different
Cross-level
sizes of text: paragraph to sentence,
semantic similarity sentence to phrase, phrase to word and
word to sense.
Subtask
Aspect Based
Subtask
Sentiment Analysis Subtask
Subtask

L2 writing
assistant

1:
2:
3:
4:

Aspect
Aspect
Aspect
Aspect

term extraction
term polarity
category detection
category polarity

Build a translation assistance system

that concerns the translation of
fragments of one language (L1), i.e.
words or phrases in a second language
(L2) context.
For example, input
(L1=French,L2=English): I rentre la

Data
10,000 English sentence pairs, each
annotated for relatedness score in meaning
and the entailment relation (entail,
contradiction, and neutral) between the two
sentences.
Training data will cover two domains: air
travel and tourism.
The data will be available in two languages:
Greek and English.

Information about data hasn't been

released yet.
Two domain-specific datasets (restaurant
reviews and laptop reviews), consisting of
over 6,500 sentences with fine-grained
aspect-level human-authored annotations
will be provided.
The data set covers the following L1 and L2
pairs : English-German, English-Spanish,
French-English and Dutch-English.
The trial data contains 500 sentences44
for
each language pair. Information about

Task
ID

SemEval-2014
Task Name

Description

Data
In trial data, each natural language
command is annotated into robot
command.
"Move the blue block on top of the grey
block." is labeled as
(event: (action: move) (entity: (color: blue)
(type: cube)) (destination: (spatial-relation:
(relation: above) (entity: (color: gray) (type:
cube)))))

Spatial Robot
Commands

Parse spatial robot commands using data

from an annotated corpus, collected from a
simplified blocks world game
(http://www.trainrobot.com)

Analysis of
Clinical Text

Combine supervised methods for

entity/acronym/abbreviation recognition and
mapping to UMLS CUIs (Concept Unique
Identifiers) with unsupervised discovery and
sense induction of the
entities/acronyms/abbreviations.

Information about data hasn't been

released yet.

Broad-Coverage
and CrossFramework
Semantic
Dependency
Parsing

This task seeks to stimulate more

generalized semantic dependency parsing
and give a more direct analysis of who did
what to whom from sentences.

In trial data, 198 sentences from WSJ are

annotated with the desired semantic
representation.

Sentiment
Analysis for
Twitter

Subtask A - Contextual Polarity

Disambiguation: Given a message containing
a marked instance of a word or a phrase,
determine whether that instance is positive,
negative or neutral in that context.
Subtask B - Message Polarity Classification:
Given a message, decide whether the
message is of positive, negative, or neutral
sentiment.

training:9,728 Twitter messages

development:1,654 Twitter messages (can
be used for training as well)
development-test
A:3,814Twittermessages (CANNOT be
used for training)
development-test B:2,094SMSmessages
(CANNOT be used for training)
45
The annotations and systems will use a

SemEval-2014
Important Dates

Trial data ready Oct. 30, 2013

Training data ready Dec. 15, 2013
Test data ready Mar. 10, 2014
Evaluation end Mar. 30, 2014
Paper submission due Apr. 30, 2014
Paper reviews due May. 30, 2014
Camera ready due Jun. 30, 2014
Workshop Aug. 23-30, 2014, Dublin, Ireland

CLEF eHealth Evaluation

Lab

Overview

The CLEF Initiative (Conference and Labs of the Evaluation

Forum,) is a self-organized body whose main mission is to
promote research, innovation, and development of
information access systems with an emphasis on
multilingual and multimodal information with various levels
of structure.
Started from 2000, the CLEF aims to stimulate
investigation and research in a wide range of key areas in
the information retrieval domain, becoming well-known in
the international IR community. The results were
traditionally presented and discussed at annual workshops
in conjunction with the European Conference for Digital
Libraries (ECDL), now called Theory and Practice on Digital
Libraries (TPDL).
47

CLEF eHealth Evaluation

Lab
Overview
In Year 2013, CLEF started eHealth Evaluation Lab, a
shared task focused on natural language
processing(NLP) and information retrieval (IR) for
clinical care.
The CLEF eHealth Evaluation Lab 2013 has three
tasks:
Annotation of disorder mentions spans from clinical reports
Annotation of acronym/abbreviation mention spans from
clinical reports
Information retrieval on medical related web documents

CLEF eHealth 2014

Tas
k
ID

Task

VisualInteractive
Search and
Exploration
of eHealth
Data

A set of de-identified clinical reports are

Develop annotated data, resources, methods that provided by the MIMIC II database.
make clinical documents easier to understand
A training set of 300 reports and their
from nurses and patients perspective.
disease/disorder mention templates with
Information
10 different attributes: Negation Indicator, Subject filled attribute: value slots will be provided.
extraction
Class, Uncertainty Indicator, Course Class,
A test set of 200 reports and their
from
Severity Class, Conditional Class, Generic Class,
disease/disorder mention templates with
clinical text
Body Location, DocTime Class, and Temporal
default-filled attribute: value slots will be
Expression, should be captured from clinical text
provided will be provided for the Task 2
and classified into certain value slot.
challenge one week before the run
submission deadline.

Usercentered
health
information
retrieval

Description

Data

Subtask A: visualize discharge summary together

with the disorder standardization and shorthand
expansion data in an effective and
6 de-identified discharge summaries and
understandable way for laypeople
50 real patient search queries genereated
Subtask B:design a visual exploration approach
from the discharge summary
that will provide an effective overview over a
larger set of possibly relevant documents to meet
the patients information need.

Subtask A: monolingual information retrieval taskretrieve the relevant medical documents for the
user queries
Subtask B: multilingual information retrieval task German, French and Czech.

Aset of medical-related documents in four

languages (English, German, French and
Czech) are provided by the Khresmoi
project (approximately 1 million medical
documents for each language). 5 training
49
queries and 50 test queries are provided.

CLEF eHealth 2014

Important Dates
CLEF2014 Lab registration opens Nov 2013
Task data release begins Nov. 15 2013
Participant submission deadline:final
submission to be evaluated May 01 2014
Results released Jun. 01 2014
Participant working notes (i.e., extended
abstracts and reports) submission deadline Jun.
15 2014
CLEF eHealth lab session at CLEF 2014 in
Sheffield, UK Sept. 15 - 18 2014
50

CoNLL
Overview
CoNLL, the Conference on Natural Language Learning is a yearly
meeting of Special Interest Group on Nature Language Learning
(SIGNLL) of the Association for Computational Linguistics (started
from 1997).
Since 1999, CoNLL has included a shared task in which training and
test data is provided by the organizers which allows participating
systems to be evaluated and compared in a systematic way.
Description of the systems and evaluation of their performances
are presented both at the conference and in the proceedings.
The last CoNLL was held in August 2013, in Sofia, Bulgaria, Europe.
Information about CoNLL 2014 and its shared task will be released
in next month.
51

CoNLL
Recent shared tasks from CoNLL
Year

Task

2013 Grammatical Error Correction

Data
National University of Singapore
Corpus of Learner English
(NUSCLE)

Modeling Multilingual Unrestricted OntoNotes dataset from

2012
Coreference in OntoNotes
Linguistic Data Consortium
Modeling Unrestricted Coreference OntoNotes dataset from
in OntoNotes
Linguistic Data Consortium
A: biological abstracts and full
articles from
Subtask A: Learning to detect
theBioScope(biomedical
sentences containing uncertainty
2010
domain) corpus
Subtask B: Learning to resolve the
B: paragraphs from Wikipedia
in-sentence scope of hedge cues
possibly
containingweaselinformation
2011

Syntactic and Semantic

2009 Dependencies in Multiple
Languages

Data with gold standard

annotation of syntactic
dependency, type of
dependency, frame, role set and
sense in multiple languages

Language
English
Arabic,
Chinese,
English
English

English

English,
Catalan,
Chinese,
Czech,
German,
Japanese and
52
Spanish

*SEM
Overview
Joint Conference on Lexical and Computational Semantics
(*SEM), started from 2012, is organized by Association for
Computational Linguistics (ACL) Special Interest Group on
Lexicon (SIGLEX) and Special Interest Group on Computational
Semantics (SIGSEM).
The main goal of *SEM is to provide a stable forum for
researchers working on different aspects of semantic processing.
Every *SEM conference includes a shared task in which training
and test data are provided by the organizers, allowing
participating systems to be evaluated and compared in a
systematic way. *SEM 2014 will release information about
shared task in Dec. or early Jan. 2014.
53

*SEM
*SEM 2012 shared task:
Description: Resolving the scope and the focus of
negation
Data: Stories by Conan Doyle, and WSJ PropBank Data
(about 8,000 sentences in total). All occurrences of
negation, their scope and focus are annotated.

*SEM 2013 shared task:

Description: Create a unified framework for the
evaluation of semantic textual similarity modules and
characterize their impact on NLP applications.
The data covers 5 areas: paraphrase sentence pairs
(MSRpar), sentence pairs from video descriptions
(MSRvid), MT evaluation sentence pairs (MTnews and
MTeuroparl) and gloss pairs (OnWN).

BioNLP
Overview
BioNLP shared tasks are organized by the ACLs
special Interest Group for biomedical natural
language processing.
BioNLP 2013 was the twelfth workshop on
biomedical natural language processing and held in
conjunction with the annual ACL or NAACL meeting.
BioNLP shared tasks are bi-annual event held with
the BioNLP workshop started from 2009. The next
event will be held in 2015.
55

BioNLP Past Shared Tasks

Year

Task

1. Genia Event Extraction from NFkB

Knowledge base construction
2. Cancer Genetics
3. Pathway Curation
4. Corpus Annotation with Gene
2013 Regulation Ontology

Data

Released
Date

End Date

Oct. 2012

NFKB Knowledge base

Apr. 2013

PubMed Literature
PubMed abstracts
PubMed Literature

7. Gene Regulation Network in Bacteria

Webpage documents
with general
information about
bacteria species
PubMed Abstracts

1. GENIA

PubMed abstracts

Dec.
2010

Apr. 2011

PubMed abstracts

PubMed abstracts
PubMed abstracts
PubMed abstracts
PubMed abstracts
PubMed abstracts
PubMed abstracts

Dec. 15
2008

Mar. 30
200956

5. Bacteria Biotopes

2. Epigenetics and Post-translational

Modifications
3. Infectious Diseases
2011
4. Bacteria Biotopes
5. Bacteria Interactions
6. Co-reference
7. Gene/Protein Entity Relations
8. Gene renaming
1. core event extraction(identify events
concerningwith the given proteins )
2009
2. Event enrichment

PubMed abstracts
PubMed abstracts

i2b2 Challenges
Informatics for Integrating Biology and the Bedside
(i2b2) is an NIH funded National Center for Biomedical
Computing (NCBC).
I2b2 center organizes data challenges to motivate the
development of scalable computational frameworks to
address the bottleneck limiting the translation of
genomic findings and hypotheses in model systems
relevant to human health.
I2b2 challenge workshops are held in conjunction with
Annual Meeting of American Medical Informatics
Association.
57

Previous i2b2 Challenges

Year

Task

Data

Release
Date

End
Date

2012

Temporal relation extraction

EHR

Jun. 2012

Sept.
2012

2011

Co-reference resolution

EHR

Jun. 2011

Sept.
2011

2010

Relation extraction on medical

problems

Discharge
summaries

Apr. 2010

Sept.
2010

2009

Medication extraction

Narrative patient
records

Jun. 2009

Sept.
2009

2008

Recognizing Obesity and comorbidities

Discharge
summaries

Mar. 2008

Sept.
2008

2006

De-identified discharge
summaries

Discharge
summaries

Jun. 2006

Sept.
2006

APPLYING TEXT MINING IN HEALTH

SOCIAL MEDIA RESEARCH:
AN EXAMPLE

Extracting Adverse Drug Events from

Health Social Forums

Online patient forums can provide valuable supplementary

information on drug effectiveness and side effects.

Those forums cover large and diverse population and contain data
directly from patients.
Patient forum ADE reports can serve as an economical alternative to
expensive and time-consuming patient-oriented drug safety data
collection projects.
It can help to generate new clinical hypothesis, cross-validate
the adverse drug events detected from other data sources, and
Post ID
Post Content
Contain
Report
conduct comparison studies.
ADE?

9043

I had horrible chest pain [Event] under Actos [Treatment].

12200

From what you have said, it seems that Lantus [Treatment] has had some negative side ADE

ADE

source

Patient
Hearsay

effects related to depression [Event] and mood swings [Event].

25139

I never experienced fatigue [Event] when using Zocor [Treatment].

34188

When taking Zocor [Treatment], I had headaches [Event] and bruising [Event].

63828

Another study of people with multiple risk factors for stroke [Event] found that Lipitor Drug
[Treatment] reduced the risk of stroke [Event] by 26% compared to those taking a Indication
placebo, the company said.

Negated
ADE
ADE

Patient
Patient
Diabetes
research

Test Bed
Discussion about
disease monitoring
and medical
products

Discussion
about disease
and medical
problems

Forum Name

Number
of Posts

American Diabetes
184,874
Association

Number of
Topics

Number of
Member Profiles

26,084

6,544

Diabetes Forums

568,684

45,830

12,075

Diabetes Forum

67,444

6,474

3,007

Time Span
2009.22012.11
2002.22012.11
2007.22012.11

Total Number
of Sentences
1,348,364
3,303,804
422,355
61

Extracting Adverse Drug Events from

Health Social Forums
Challenges
Topics in patient social media cover various sources, including
news and research, hearsay (stories of other people) and
patients experience. Redundant and noisy information often
masks patient-experienced ADEs.
Currently, extracting adverse event and drug relation in patient
comments results in low precision due to confounding with drug
indications (Legitimate medical conditions a drug is used for )
and negated ADE (contradiction or denial of experiencing ADEs)
in sentences.

Solutions
Develop relation extractor for recognizing and extracting
adverse drug event relations.
Develop a text classifier to extract adverse drug event reports
based on patient experience.
62

Extracting Adverse Drug Event from

Health Social Forums

Patient Forum Data Collection: collect patient forum data through a web crawler
Data Preprocessing: remove noisy text including URL, duplicated punctuation, etc,
separate post to individual sentences.
Medical entity extraction: identify treatments and adverse events discussed in
forum
Adverse drug event extraction: identify drug-event pairs indicating an adverse
drug event based on results of medical entity extraction
Report source classification: classify the source of reported events either from
patient experience or hearsay
63

Medical Entity Extraction

Initialize the medical entity

extraction with MetaMap
to match terms related to
drugs and ADEs in forum
discussion.

Filter the terms extracted

by MetaMap that never
appear in FAERS reports.

Query Consumer Health

Vocabulary for consumer
preferred terms of the
entities extracted by
MetaMap and look up those
consumer vocabularies in
the discussions.

MetaMap is a Java API that extract

medical terms in UMLS. The figure
below shows sample output of
MetaMap.

FAERS is FDAs knowledge base which

contains adverse drug event reports
filed by consumers, doctors and drug
companies.
ConsumerHealthVocabulary is a lexicon
for mapping consumer preferred terms
to terms in standard biomedical
64
ontology such as UMLS.

Adverse Drug Event

Extraction

Kernel based statistical learning

Feature generation
Generate representations of the relation
instances

Syntactic and semantic classes

mapping
Categorize lexical features into syntactic and
semantic classes to reduce the feature
sparsity

Shortest dependency path kernel

Compute the similarity score between two
relation instances

Semantic filtering
Drug indications from FAERS
Incorporate medical domain knowledge
for differentiating drug indication from
adverse events

NegEX
Incorporate linguistic knowledge to
identify negated adverse drug events.

Semantic templates
Form filtering templates using the
knowledge from FAERS and NegEX.
65

Rule based classification

Adverse Drug Event

Extraction
Feature generation

We utilized the Stanford Parser (http://nlp.stanford.edu/software/stanforddependencies.shtml) for dependency parsing.

The figure above shows the dependency tree of a sentence. In this sentence,
hypoglycemia is an adverse event and Lantus is a diabetes treatment.
Grammatical relations between words are illustrated in the figure. For instance,
cause and hypoglycemia have a relation dobj as hypoglycemia is the direct
object of cause. In this relation, cause is the governor and hypoglycemia is the
66
dependent.

Adverse Drug Event

Extraction
Syntactic and Semantic Classes Mapping

To reduce the data sparsity and increase the robustness of our method,
we expand shortest dependency path by categorizing words on the
path into syntactic and semantic classes with varying degrees of
generality.

Word classes include part-of-speech (POS) tags and generalized POS tags.
POS tags are extracted with Stanford CoreNLP packages. We generalized
the POS tags with Penn Tree Bank guidelines for the POS tags. Semantic
types (Event and Treatments) are also used for the two ends of the shortest
path.

Syntactic and Semantic Classes Mapping from

dependency graph

The relation instance in the figure above is represented as a sequence of features

X=[x1,x2,x3,x4,x5,x6,x7],
where x1={Hypoglycemia, NN, Noun, Event}, x2={->}, x3={cause, VB, Verb}, x4 ={<-},
x5={action, NN, Noun}, x6={<-}, x7={Lantus, NN, Noun, Treatment}.
67

Adverse Drug Event

Extraction
Shortest Dependency Path Kernel function

If x=x1x2xm and y=y1y2..yn are two relation examples,

where xi denotes the set of word classes corresponding to
position i, the kernel function is computed as in equation
below (Bunescu et al. 2005).

C ( xi , yi ) | xi yi |

is the number of common word classes between xi and yi.

Relation instance X=[{Hypoglycemia, NN, Noun, Event}, {->}, {cause, VB,

Verb}, {<-}, {action, NN, Noun}, {<-}, {Lantus, NN, Noun, Treatment}].
Relation instance y=[{depression, NN, Noun, Event}, {->}, {indicate, VBP,
Verb}, {<-}, {effect, NN, Noun}, {<-}, {Lantus, NNP, Noun, Treatment}].

K(x,y) can be computed as the product of the number of common

features xi and yi in position i.
K(x,y)=3*1*1*1*2*1*3=18.
68

Adverse Drug Event

Extraction
SVM Classification
There are a lot of SVM software/tools have been
developed and commercialized.
Among them, SVM-light package and LIBSVM are
two of the most widely used tools. Both are free
of charge and can be downloaded from the
Internet.
SVM-light is available at http://svmlight.joachims.org/
LIBSVM can be found at
http://www.csie.ntu.edu.tw/~cjlin/libsvm/

Adverse Drug Event

Extraction
SVM-light

Adverse Drug Event

Extraction
ALGORITHM .STATISTICAL LEARNING FOR ADVERSE DRUG EVENT
EXTRACTION
Input: all the relation instances with a pair of related drug and medical
events, R(drug, event).
Output: whether the instances have a pair of related drug and event
Procedure:
1. For each relation instance R(drug,event) :
Generate Dependency tree T of R(drug,event)
Features = Shortest Dependency Path Extraction (T, R)
Features = Syntactic and Semantic Classes Mapping (Features)
2. Separate relation instances into training set and test set
3. Train a SVM classifier C with shortest dependency kernel function based on
the training set
4. Use the SVM classifier C to classify instances in the test set into two classes
R(drug, event) = True and R(drug, event) = False.
71

Adverse Drug Event

Extraction
ALGORITHM .SEMANTIC FILTERING ALGORITHM
Input: a relation instance i with a pair of related drug and
medical events, R(drug, event).
Output: The relation type.
If drug exists in FAERS:
Get indication list for drug;
For indication in indication list:
If event= indication:
Return R(drug, event) = Drug Indication;
For rule in NegEX:
If relation instance i matches rule:
Return R(drug, event) = Negated Adverse Drug Event;
Return R(drug, event) = Adverse Drug Event;
72

Report Source Classification

In order to classify the report source of adverse
drug events, we developed a feature-based
classification model to distinguish patient reports
from hearsay based on the prior studies.
We adopted BOW features and Transductive
Support Vector Machines in SVM-light for
classification.

Evaluation on Medical Entity

Extraction
Results of Medical Entity Extraction
Precision
93.9%
91.7%92.5%

f-measure

92.5%
90.8%91.6%
87.3%

91.4%90.5%90.9%
86.5%

83.5%
80.3%

Recall

83.5%
80.7%

85.4%
82.3%
79.5%

The performance of our system (F-measure) surpasses the best

performance in prior studies ( F-measure73.9% ), which is achieved
by applying UMLS and MedEffect to extract adverse events from
DailyStrength (Leaman et al., 2010). There may be several causes
for our approach to outperform prior work.

Combination of multiple lexicons improves precision.

DailyStrength is a general health social website where users may have more
diverse health vocabulary and develop more linguistic creativity. Extracting
medical named entities could be more difficult than our data source.

Evaluation on Adverse Drug Event

Extraction
Resultsof AdverseDrugEventExtrac on
Precision
100.0%

Recall

F-measure

100.0%

82.0%
55.6%

62.0%
56.5%59.2%

SL
American Diabetes Associa on

61.9%

64.2%60.4%62.2%

SL+SF

75.2%

68.3%
60.4%

44.8%

38.5%

78.6%

66.9%
56.6%

59.6%

62.5%
58.0%60.2%

65.5%
58.0%

41.5%

SL
DiabetesForums

SL+SF

Diabetes Forum

Compared to co-occurrence based approach (CO), statistical learning

(SL) contributed to the increase of precision from around 40% to
above 60% while the recall dropped from 100% to around 60%. Fmeasure of SL is better than CO method.
Semantic filtering (SF) further improved the precision in extraction
from 60% to about 80% by filtering drug indications and negated
ADEs.
75

Evaluation on Report Source

Classification
Results of Report Source Classification
Precision
100.0%
76.2%
61.5%

Recall

F-measure

100.0%
83.9%84.3%84.1%

100.0%
81.2%83.1%82.1%

80.2%82.4%81.3%

69.0%
52.7%

67.9%
51.4%

Without report source classification (RSC), the performance of extraction is

heavily affected by noise in the discussion.
The precision ranged from 51% to 62% without RSC.
Overall performance (F-measure) ranged from 68% to 76%

After report source classification, the precision and F-measure significantly

improved.
The precision increased from 51% up to 84%
The overall performance (F-measure ) increased from 68% to above 80%.

Contrast of Our Proposed Framework to Cooccurrence based approach

Contrast of Our Proposed Framework to Co-occurrence based approach
Total Relation Instances
100%

Adverse Drug Events

100%

21.94%
1069

39.27%

37.98%

35.97%

2972

Patient Reported ADEs

652

American Diabetes Association

19.74%
365
2

1387

Diabetes Forums

721

18.10%
107
2

421

194

Diabetes Forum

There are a large number of false adverse drug events which

couldnt be filtered out by co-occurrence based approach.
Based on our approach , only 35% to 40% of all the relation
instances contain adverse drug events.
Among them, about 50% comes from patient reports.

References

*SEM: http://ixa2.si.ehu.es/starsem/
CoNLL: http://ifarm.nl/signll/conll/
SemEval: http://alt.qcri.org/semeval2014/
CLEF eHealth: http://clefehealth2014.dcu.ie/home
BioNLP: http://2013.bionlp-st.org/
I2b2:https://www.i2b2.org/
Benton A., Ungar L., Hill S., Hennessy S., Mao J., Chung A., & Holmes J. H. (2011). Identifying potential adverse
effects using the web: A new approach to medical hypothesis generation. Journal of biomedical informatics, 44(6),
pp. 989-996.
Bian, J., Topaloglu, U., & Yu, F. (2012). Towards large-scale twitter mining for drug-related adverse events.
InProceedings of the 2012 ACM International Workshop on Smart health and wellbeing,pp. 25-32.
Bunescu R.C., Mooney R.J. (2005). A Shortest Path Dependency Kernel for Relation Extraction. In: Proceedings of the
conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 724-731.
Chee B. W., Berlin R., & Schatz B. (2011). Predicting adverse drug events from personal health messages. In:AMIA
Annual Symposium ProceedingsVol. 2011, pp. 217-226
Culotta, A., & Sorensen, J. (2004, July). Dependency tree kernels for relation extraction. InProceedings of the 42nd
Annual Meeting on Association for Computational LinguisticsAssociation for Computational Linguistics, pp. 423-429.
Leaman R., Wojtulewicz L, Sullivan R, Skariah A., Yang J, Gonzalez G. (2010) Towards Internet- Age
Pharmacovigilance: Extracting Adverse Drug Reactions from User Posts to Health-Related Social Networks, In:
Proceedings of the 2010 Workshop on Biomedical Natural Language Processing, ACL, pp.117-125.
Liu, X., & Chen, H. (2013). AZDrugMiner: an information extraction system for mining patient-reported adverse drug
events in online patient forums. InSmart Health.Springer Berlin Heidelberg, pp. 134-150.
Yang C. C., Yang H., Jiang L., & Zhang M. (2012). Social media mining for drug safety signal detection.
In:Proceedings of the 2012 international workshop on Smart health and wellbeingACM, pp. 33-40.
Zelenko D., Aone C. and Richardella A(2003): Kernel methods for relation extraction. Journal of Machine Learning
Research, 3, pp.1083-1106.

2 5460709557158281519 PDF
75% (16)
2 5460709557158281519 PDF
208 pages
ENGLISH 3 (1st Quarter) - ARNALOU
71% (7)
ENGLISH 3 (1st Quarter) - ARNALOU
50 pages
Topic Modelling Using NLP
No ratings yet
Topic Modelling Using NLP
18 pages
Torrents Languages
100% (1)
Torrents Languages
4 pages
Text Mining
No ratings yet
Text Mining
25 pages
Artificial Intelligence in Automobile Industry
100% (1)
Artificial Intelligence in Automobile Industry
17 pages
Australian English
0% (1)
Australian English
15 pages
''Ainu For Beginners'' (Unilang, 2009)
100% (1)
''Ainu For Beginners'' (Unilang, 2009)
48 pages
Speech Recognition Using Machine Learning
No ratings yet
Speech Recognition Using Machine Learning
8 pages
Robotics
No ratings yet
Robotics
10 pages
Human Activity Recognition
No ratings yet
Human Activity Recognition
40 pages
Twitter Sentiment Analysis
No ratings yet
Twitter Sentiment Analysis
4 pages
Deep Learning Unit 1
No ratings yet
Deep Learning Unit 1
32 pages
Twitter Sentiment Analysis With Textblob
No ratings yet
Twitter Sentiment Analysis With Textblob
6 pages
Parvathy V J, Engineer Special Programs, Livewire, Trivandrum
No ratings yet
Parvathy V J, Engineer Special Programs, Livewire, Trivandrum
35 pages
Introduction To Object Detection
No ratings yet
Introduction To Object Detection
24 pages
Text Mining With R - Twitter Data Analysis
No ratings yet
Text Mining With R - Twitter Data Analysis
24 pages
Prompt Engineering For Vision Models Slides 1720084286
No ratings yet
Prompt Engineering For Vision Models Slides 1720084286
17 pages
NLP Part1
No ratings yet
NLP Part1
67 pages
Least Mastered Competencies - English
100% (1)
Least Mastered Competencies - English
3 pages
4-Data Cleaning, Data Integration, Data Transformation, Data Reduction-03-02-2024
No ratings yet
4-Data Cleaning, Data Integration, Data Transformation, Data Reduction-03-02-2024
22 pages
PPT1
No ratings yet
PPT1
93 pages
A Facial Recognition System
No ratings yet
A Facial Recognition System
4 pages
Final Twitter - Sentiment - Analysis - Report
100% (1)
Final Twitter - Sentiment - Analysis - Report
14 pages
DataMining S
No ratings yet
DataMining S
103 pages
Safety Evaluation Process For AI Based Autonomous Systems - Pedroza - Adedjouma
No ratings yet
Safety Evaluation Process For AI Based Autonomous Systems - Pedroza - Adedjouma
17 pages
Lab Program
100% (1)
Lab Program
15 pages
Types and Components of Computer System
No ratings yet
Types and Components of Computer System
23 pages
Paper 1-Bidirectional LSTM With Attention Mechanism and Convolutional Layer
100% (1)
Paper 1-Bidirectional LSTM With Attention Mechanism and Convolutional Layer
51 pages
AI Unit 3
No ratings yet
AI Unit 3
85 pages
NLP Presentation
No ratings yet
NLP Presentation
19 pages
The Origins of ESP
No ratings yet
The Origins of ESP
6 pages
Lecture 1
100% (1)
Lecture 1
21 pages
Session 11-12 - Text Analytics
No ratings yet
Session 11-12 - Text Analytics
38 pages
Parallel Structure Practice Activity Gr6 12
No ratings yet
Parallel Structure Practice Activity Gr6 12
1 page
Machine Learning Blockchain
100% (1)
Machine Learning Blockchain
15 pages
Agile Development Software Architecture and Functional Safety Three Views of A System Challenge
No ratings yet
Agile Development Software Architecture and Functional Safety Three Views of A System Challenge
28 pages
Future: Be Going To
No ratings yet
Future: Be Going To
2 pages
Benefits of Use Cases
50% (2)
Benefits of Use Cases
6 pages
Data Science Introduction
No ratings yet
Data Science Introduction
9 pages
Feature Engineering
No ratings yet
Feature Engineering
23 pages
Anomaly Detection: Course: Data Mining II
No ratings yet
Anomaly Detection: Course: Data Mining II
12 pages
Text and Sentiment Analysis
No ratings yet
Text and Sentiment Analysis
41 pages
Data Science Project
No ratings yet
Data Science Project
3 pages
Predicting Cyberbullying On Social Media in The Big Data Era Using Machine Learning Algorithms Review of Literature and Open Challenges PDF
No ratings yet
Predicting Cyberbullying On Social Media in The Big Data Era Using Machine Learning Algorithms Review of Literature and Open Challenges PDF
18 pages
Arun Mani Sam, R&D Software Engineer
No ratings yet
Arun Mani Sam, R&D Software Engineer
21 pages
Unit-8: Natural Language: Processing
No ratings yet
Unit-8: Natural Language: Processing
16 pages
Evolution of Machine Learning
No ratings yet
Evolution of Machine Learning
7 pages
English Spelling Rules
No ratings yet
English Spelling Rules
1 page
SUpport Vector Machine
No ratings yet
SUpport Vector Machine
28 pages
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
UNIT 1 The Alphabet
No ratings yet
UNIT 1 The Alphabet
6 pages
Subject Code:Mb20Ba01 Subject Name: Data Visulization For Managers Faculty Name: Dr.M.Karthikeyan
No ratings yet
Subject Code:Mb20Ba01 Subject Name: Data Visulization For Managers Faculty Name: Dr.M.Karthikeyan
34 pages
Data Science Introduction
No ratings yet
Data Science Introduction
82 pages
Unit I
No ratings yet
Unit I
30 pages
Sanskrit Sahakarin
No ratings yet
Sanskrit Sahakarin
59 pages
Data Visualization Techniques
No ratings yet
Data Visualization Techniques
20 pages
Revision Questions and Anwers On AI
No ratings yet
Revision Questions and Anwers On AI
14 pages
Overview of Artificial Intelligence: Abu Saleh Musa Miah
No ratings yet
Overview of Artificial Intelligence: Abu Saleh Musa Miah
54 pages
Introduction To Natural Language Processing (NLP)
No ratings yet
Introduction To Natural Language Processing (NLP)
87 pages
Intro of Data Mining
No ratings yet
Intro of Data Mining
27 pages
Building Recommendation System Using Movielens Data
No ratings yet
Building Recommendation System Using Movielens Data
6 pages
Text Mining Techniques Applications and Issues2
No ratings yet
Text Mining Techniques Applications and Issues2
5 pages
DataMining Lecture 1
No ratings yet
DataMining Lecture 1
35 pages
7 More Steps To Mastering Machine Learning With Python - Page1
No ratings yet
7 More Steps To Mastering Machine Learning With Python - Page1
8 pages
Software Engineering and Project Management - Unit 4
No ratings yet
Software Engineering and Project Management - Unit 4
14 pages
Data Science Training in Hyderabad
No ratings yet
Data Science Training in Hyderabad
7 pages
Unit II Requirements Elicitation
No ratings yet
Unit II Requirements Elicitation
23 pages
CS 8520: Artificial Intelligence: Knowledge Representation
No ratings yet
CS 8520: Artificial Intelligence: Knowledge Representation
30 pages
Worksheet 2
No ratings yet
Worksheet 2
1 page
Artificial Intelligence and Autonomous Vehicles
No ratings yet
Artificial Intelligence and Autonomous Vehicles
6 pages
DL HB Needs Analysis Questionnaire
No ratings yet
DL HB Needs Analysis Questionnaire
3 pages
Rubric For The Random Acts of Kindness To Your Community
100% (1)
Rubric For The Random Acts of Kindness To Your Community
2 pages
Educ Action Research Powerpoint
No ratings yet
Educ Action Research Powerpoint
17 pages
Speech Recognition Full Report
No ratings yet
Speech Recognition Full Report
11 pages
05alab5 Audition PDF
No ratings yet
05alab5 Audition PDF
23 pages
Tugas Eyl
No ratings yet
Tugas Eyl
18 pages
Habitats Research Brochure
No ratings yet
Habitats Research Brochure
5 pages
L F A C A T: AB Our Using Udition
No ratings yet
L F A C A T: AB Our Using Udition
14 pages
L F A C A T: AB Our Using Udacity
No ratings yet
L F A C A T: AB Our Using Udacity
18 pages
06lab6 PDF
No ratings yet
06lab6 PDF
21 pages
Hardware Software Mindmap
No ratings yet
Hardware Software Mindmap
1 page
11lab11 PDF
No ratings yet
11lab11 PDF
11 pages
An Intersemiotic Translation of Poetry Is Useless PDF
No ratings yet
An Intersemiotic Translation of Poetry Is Useless PDF
15 pages
Mira Nuri Santika 1811040002 6C Semantic Pragmatic
No ratings yet
Mira Nuri Santika 1811040002 6C Semantic Pragmatic
11 pages
CTC Lesson Plan Checklist Feedback Form
No ratings yet
CTC Lesson Plan Checklist Feedback Form
3 pages
07alab7 ProTools PDF
No ratings yet
07alab7 ProTools PDF
29 pages
08lab8 PDF
No ratings yet
08lab8 PDF
20 pages
07lab7 PDF
No ratings yet
07lab7 PDF
6 pages
News Writing
No ratings yet
News Writing
3 pages
Shift in Translation
No ratings yet
Shift in Translation
4 pages
07alab7 Audition PDF
No ratings yet
07alab7 Audition PDF
30 pages
04lab4 PDF
No ratings yet
04lab4 PDF
22 pages
Types of Computers Mindmap
No ratings yet
Types of Computers Mindmap
1 page
08alab8 Audacity PDF
No ratings yet
08alab8 Audacity PDF
13 pages
10alab10 Audition PDF
No ratings yet
10alab10 Audition PDF
11 pages
10alab10 Audition PDF
No ratings yet
10alab10 Audition PDF
11 pages
06alab6 ProTools PDF
No ratings yet
06alab6 ProTools PDF
6 pages
Hardware Out Put Devices
No ratings yet
Hardware Out Put Devices
1 page
09alab9 Audacity PDF
No ratings yet
09alab9 Audacity PDF
5 pages
06alab6 Audition PDF
No ratings yet
06alab6 Audition PDF
5 pages
Future Progressive Tense
No ratings yet
Future Progressive Tense
11 pages
Table of Specs
No ratings yet
Table of Specs
8 pages
05lab5 PDF
No ratings yet
05lab5 PDF
2 pages
11alab11 Audacity
No ratings yet
11alab11 Audacity
1 page
El Concepto de Sinificado Desde El Analisis Del Comportamiento y Otras Perspectivas
No ratings yet
El Concepto de Sinificado Desde El Analisis Del Comportamiento y Otras Perspectivas
14 pages
50 Phrasal Verbs For Work and Business
No ratings yet
50 Phrasal Verbs For Work and Business
4 pages
Lesson Plan 2
No ratings yet
Lesson Plan 2
5 pages
Phrasal Verbs1 - Lesson
No ratings yet
Phrasal Verbs1 - Lesson
4 pages
Peer Observation Sheet
No ratings yet
Peer Observation Sheet
3 pages
Neural Networks and Fuzzy Logic
From Everand
Neural Networks and Fuzzy Logic
C. Naga Bhaskar
No ratings yet
Software Asset Management: What Is It and Why Do I Need It?: A Textbook on the Fundamentals in Software License Compliance, Audit Risks, Optimizing Software License ROI, Business Practices and Life Cycle Management
From Everand
Software Asset Management: What Is It and Why Do I Need It?: A Textbook on the Fundamentals in Software License Compliance, Audit Risks, Optimizing Software License ROI, Business Practices and Life Cycle Management
Carl A. Bolton
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.