0% found this document useful (0 votes)
17 views2 pages

KEA Practical Automatic Keyphrase Extraction

Uploaded by

rickshark
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views2 pages

KEA Practical Automatic Keyphrase Extraction

Uploaded by

rickshark
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

KEA: Practical Automatic Keyphrase Extraction

Ian H. Witten,* Gordon W. Paynter,* Eibe Frank,*


Carl Gutwin† and Craig G. Nevill-Manning‡
* † ‡
Dept of Computer Science, Dept of Computer Science, Dept of Computer Science,
University of Waikato, University of Saskatchewan, Rutgers University,
Hamilton, New Zealand. Saskatoon, Canada Piscataway, New Jersey
{ihw,gwp,eibe}@cs.waikato.ac.nz gutwin@cs.usask.ca nevill@cs.rutgers.edu

Keyphrases provide semantic metadata that summarize THE KEA ALGORITHM


and characterize documents. Kea is an algorithm for Kea is an algorithm for automatically extracting keyphrases
automatically extracting keyphrases from text. We use a from text. The algorithm has two stages:
large test corpus to evaluate its effectiveness in terms of
1. Training: create a model for identifying keyphrases, using
how many author-assigned keyphrases are correctly
training documents where the author’s keyphrases are
identified. The system is simple, robust, and publicly
known.
available. Kea identifies candidate keyphrases using
lexical methods, calculates feature values for each 2. Extraction: choose keyphrases from a new document, using
candidate, and uses a machine-learning algorithm to the above model.
predict which candidates are good keyphrases. The Both stages choose a set of candidate phrases from their input
machine learning scheme first builds a prediction model documents, and then calculate the values of certain attributes,
using training documents with known keyphrases, and or features, for each candidate.
then uses the model to find keyphrases in new
Candidate phrases. Kea chooses candidate phrases in three
documents.
steps. It first cleans the input text, then identifies candidates,
Keyphrases are useful because they briefly summarize a and finally stems and case-folds the phrases. After splitting the
document’s content. As large document collections such text into words and sentences, Kea considers all the
as digital libraries become widespread, the value of such subsequences in each sentence and determines which of these
summary information increases. Keywords and are suitable candidate phrases. All words are then case-folded
keyphrases are particularly useful because they can be and stemmed.
interpreted individually and independently of each other.
Feature Calculation. Two features are calculated for each
They can be used in information retrieval systems as
candidate phrase and used in training and extraction. They are
descriptions of the documents returned by a query, as the
TF×IDF, a measure of a phrase’s frequency in a document
basis for search indexes, as a way of browsing a
compared to its rarity in general use; and first occurrence,
collection, and as a document clustering technique (e.g.
which is the distance into the document of the phrase’s first
[2], [3], [4]).
appearance.
Keyphrases are usually chosen manually. In many
Training. The training stage uses a set of training documents
academic contexts, authors assign keyphrases to
documents they have written. Professional indexers often for which the author’s keyphrases are known. For each
training document, candidate phrases are identified and their
choose phrases from a “controlled vocabulary” that is
feature values are calculated as described above. The scheme
predefined for the domain at hand. However, the great
then generates a model that predicts the class using the values
majority of documents come without keyphrases, and
of the other two features.
assigning them manually is a tedious process that
requires knowledge of the subject matter. Automatic We have experimented with a number of different machine
extraction techniques are potentially of great benefit. learning schemes; Kea uses the Naïve Bayes technique
because it is simple and yields good results [1]. This scheme
learns two sets of numeric weights from the discretized feature
values, one set applying to positive (“is a keyphrase”)
Permission to make digital or hard copies of all or part of this work for
examples and the other to negative (“is not a keyphrase”)
personal or classroom use is granted without fee provided that copies instances.
are not made or distributed for profit or commercial advantage and that
Extracting keyphrases from new documents. To select
copies bear this notice and the full citation on the first page. To copy
otherwise, to republish, to post on servers or to redistribute to lists, keyphrases from a new document, Kea extracts candidate
requires prior specific permission and/or a fee. phrases, determines feature values, and then applies the model
DL 99, Berkeley, CA USA built during training. The model determines the overall
Copyright ACM 1999 1-58113-145-3/99/08 . . . $5.00
probability that each candidate is a keyphrase, and then a post-
processing operation selects the best set of keyphrases.

254
Protocols for secure, atomic transaction Neural multigrid for gauge theories and Proof nets, garbage, and computations
execution in electronic commerce other disordered systems

anonymity atomicity disordered disordered cut-elimination cut


atomicity auction systems gauge linear logic cut elimination
auction customer gauge fields gauge fields proof nets garbage
electronic electronic multigrid interpolation kernels sharing graphs proof net
commerce commerce neural multigrid length scale typed lambda- weakening
privacy intruder neural networks multigrid calculus
real-time merchant smooth
security protocol
transaction security
third party
transaction
Figure 1 Examples of author- and Kea-assigned keyphrases

EVALUATION CONCLUSION
We carried out an empirical evaluation of Kea using Kea is an algorithm for automatically extracting key phrases
documents from the New Zealand Digital Library [5]. from text. Our goal is to provide useful metadata where none
Our goals were to assess Kea’s overall effectiveness, and existed before. By extracting reasonable summaries from text
also to investigate the effects of varying several documents, we give a valuable tool to designers and users of
parameters in the extraction process. We measured digital libraries.
keyphrase quality by counting the number of matches In future, we plan to expand the evaluation of the algorithm. In
between Kea’s output and the keyphrases that were particular, we have been working with the assumption that
originally chosen by the document’s author. Figure 1 using author-specified keyphrases to evaluate the scheme is a
lists the Kea- and author-assigned keyphrases for three reasonable indicator of finding ‘good’ keyphrases. However,
computer science technical reports. Phrases that appear in the near future we will test that assumption by evaluating
in both lists are italicized. Kea’s output using human expert judges, and by comparing
Our results show that Kea can on average match between Kea to other document summarization methods.
one and two of the five keyphrases chosen by the author Kea is available from the New Zealand Digital Library project
in this collection [1]. We consider this to be good (http://www.nzdl.org/).
performance. Although Kea find less than half the
author’s phrases, it must choose from many thousands of REFERENCES
candidates; also, it is highly unlikely that even another [1] Frank E., Paynter G.W., Witten I.H., Gutwin C. and.
human would select the same set of phrases as the Nevill-Manning C.G. (1999). Domain-Specific
original author. Keyphrase Extraction. In Proceedings of the Sixteenth
Furthermore, we have determined that the following are International Joint Conference on Artificial Intelligence,
reasonable minimums on source data for using Kea Morgan Kaufmann Publishers, San Francisco, CA.
effectively: [2] Gutwin, C., Paynter, G., Witten, I.H., Nevill-Manning,
• Kea works well with a training set of as few as 20 C.G., and Frank, E. (1999) Improving Browsing in
documents, meaning that human indexers need only Digital Libraries With Keyphrase Indexes. J. Decision
assign manual keyphrases to a small number of Support Systems. To Appear.
documents in order to extract good keyphrases from [3] Jones, S. and Paynter G.W. (1999) Topic Based
the rest of the collection. Browsing Within a Digital Library Using Keyphrases. In
• Kea works best on the full text of documents, rather Proc. DL’99.
than just titles and abstracts [4] Witten I.H. (1999) Browsing around a digital library. In
• The global document corpus (used to calculate Proc. Australasian Computer Science Conference,
TFxIDF scores) can contain as few as 10 documents, Auckland, New Zealand, 1–14.
and does not need to contain documents that are [5] Witten, I.H., McNab, R., Jones, S., Apperley, M.,
similar to the collection being processed. Bainbridge, D., and Cunningham, S.J. (1999) Managing
Complexity in a Distributed Digital Library. IEEE
Computer, 32, 2 (1999), 74-79.

255

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy