0% found this document useful (0 votes)
5 views12 pages

Abstractive Text Summarization Using Sequence-To-S

This conference paper presents a novel approach to abstractive text summarization using attentional encoder-decoder recurrent neural networks (RNNs), achieving state-of-the-art performance on two corpora. The authors propose several models that address key challenges in summarization, such as handling rare or unseen words and capturing hierarchical document structure. Additionally, they introduce a new dataset for multi-sentence summarization and establish benchmarks for future research.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views12 pages

Abstractive Text Summarization Using Sequence-To-S

This conference paper presents a novel approach to abstractive text summarization using attentional encoder-decoder recurrent neural networks (RNNs), achieving state-of-the-art performance on two corpora. The authors propose several models that address key challenges in summarization, such as handling rare or unseen words and capturing hierarchical document structure. Additionally, they introduce a new dataset for multi-sentence summarization and establish benchmarks for future research.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/306093640

Abstractive Text Summarization Using Sequence-to-Sequence RNNs and


Beyond

Conference Paper · February 2016


DOI: 10.18653/v1/K16-1028

CITATIONS READS

2,455 5,962

5 authors, including:

Ramesh Nallapati Bowen Zhou


Amazon University of Michigan
115 PUBLICATIONS 10,850 CITATIONS 128 PUBLICATIONS 11,363 CITATIONS

SEE PROFILE SEE PROFILE

Cicero Nogueira Dos Santos Caglar Gulcehre


IBM Research, T. J. Watson Research Center Université de Montréal
110 PUBLICATIONS 9,466 CITATIONS 69 PUBLICATIONS 45,226 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Cicero Nogueira Dos Santos on 18 August 2016.

The user has requested enhancement of the downloaded file.


Abstractive Text Summarization using Sequence-to-sequence RNNs and
Beyond
Ramesh Nallapati Bowen Zhou Cicero dos Santos
IBM Watson IBM Watson IBM Watson
nallapati@us.ibm.com zhou@us.ibm.com cicerons@us.ibm.com

Çağlar Gu̇lçehre Bing Xiang


Université de Montréal IBM Watson
gulcehrc@iro.umontreal.ca bingxia@us.ibm.com

Abstract speech recognition (Bahdanau et al., 2015) and


video captioning (Venugopalan et al., 2015). In
In this work, we model abstractive text the framework of sequence-to-sequence models,
summarization using Attentional Encoder- a very relevant model to our task is the atten-
Decoder Recurrent Neural Networks, and tional Recurrent Neural Network (RNN) encoder-
show that they achieve state-of-the-art per- decoder model proposed in Bahdanau et al.
formance on two different corpora. We (2014), which has produced state-of-the-art per-
propose several novel models that address formance in machine translation (MT), which is
critical problems in summarization that also a natural language task.
are not adequately modeled by the basic Despite the similarities, abstractive summariza-
architecture, such as modeling key-words, tion is a very different problem from MT. Unlike
capturing the hierarchy of sentence-to- in MT, the target (summary) is typically very short
word structure, and emitting words that and does not depend very much on the length of
are rare or unseen at training time. Our the source (document) in summarization. Addi-
work shows that many of our proposed tionally, a key challenge in summarization is to op-
models contribute to further improvement timally compress the original document in a lossy
in performance. We also propose a new manner such that the key concepts in the original
dataset consisting of multi-sentence sum- document are preserved, whereas in MT, the trans-
maries, and establish performance bench- lation is expected to be loss-less. In translation,
marks for further research. there is a strong notion of almost one-to-one word-
level alignment between source and target, but in
1 Introduction
summarization, it is less obvious.
Abstractive text summarization is the task of gen- We make the following main contributions in
erating a headline or a short summary consisting this work: (i) We apply the off-the-shelf atten-
of a few sentences that captures the salient ideas of tional encoder-decoder RNN that was originally
an article or a passage. We use the adjective ‘ab- developed for machine translation to summariza-
stractive’ to denote a summary that is not a mere tion, and show that it already outperforms state-
selection of a few existing passages or sentences of-the-art systems on two different English cor-
extracted from the source, but a compressed para- pora. (ii) Motivated by concrete problems in sum-
phrasing of the main contents of the document, marization that are not sufficiently addressed by
potentially using vocabulary unseen in the source the machine translation based model, we propose
document. novel models and show that they provide addi-
This task can also be naturally cast as map- tional improvement in performance. (iii) We pro-
ping an input sequence of words in a source doc- pose a new dataset for the task of abstractive sum-
ument to a target sequence of words called sum- marization of a document into multiple sentences
mary. In the recent past, deep-learning based mod- and establish benchmarks.
els that map an input sequence into another out- The rest of the paper is organized as follows.
put sequence, called sequence-to-sequence mod- In Section 2, we describe each specific problem
els, have been successful in many problems such in abstractive summarization that we aim to solve,
as machine translation (Bahdanau et al., 2014), and present a novel model that addresses it. Sec-

280
Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning (CoNLL), pages 280–290,
Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics
tion 3 contextualizes our models with respect to document, around which the story revolves. In
closely related work on the topic of abstractive text order to accomplish this goal, we may need to
summarization. We present the results of our ex- go beyond the word-embeddings-based represen-
periments on three different data sets in Section 4. tation of the input document and capture addi-
We also present some qualitative analysis of the tional linguistic features such as parts-of-speech
output from our models in Section 5 before con- tags, named-entity tags, and TF and IDF statis-
cluding the paper with remarks on our future di- tics of the words. We therefore create additional
rection in Section 6. look-up based embedding matrices for the vocab-
ulary of each tag-type, similar to the embeddings
2 Models for words. For continuous features such as TF
and IDF, we convert them into categorical values
In this section, we first describe the basic encoder-
by discretizing them into a fixed number of bins,
decoder RNN that serves as our baseline and then
and use one-hot representations to indicate the bin
propose several novel models for summarization,
number they fall into. This allows us to map them
each addressing a specific weakness in the base-
into an embeddings matrix like any other tag-type.
line.
Finally, for each word in the source document, we
2.1 Encoder-Decoder RNN with Attention simply look-up its embeddings from all of its as-
and Large Vocabulary Trick sociated tags and concatenate them into a single
long vector, as shown in Fig. 1. On the target side,
Our baseline model corresponds to the neural ma- we continue to use only word-based embeddings
chine translation model used in Bahdanau et al. as the representation.
(2014). The encoder consists of a bidirectional
GRU-RNN (Chung et al., 2014), while the decoder
consists of a uni-directional GRU-RNN with the
same hidden-state size as that of the encoder, and
OutputLayer

Attention mechanism
an attention mechanism over the source-hidden
states and a soft-max layer over target vocabu-
HiddenState

lary to generate words. In the interest of space,


we refer the reader to the original paper for a de- W W W W
InputLayer

DECODER
tailed treatment of this model. In addition to the
POS POS POS POS
NER NER NER NER

basic model, we also adapted to the summariza- TF


IDF
TF
IDF
TF
IDF IDF
TF

tion problem, the large vocabulary ‘trick’ (LVT) ENCODER

described in Jean et al. (2014). In our approach, Figure 1: Feature-rich-encoder: We use one embedding
the decoder-vocabulary of each mini-batch is re- vector each for POS, NER tags and discretized TF and IDF
stricted to words in the source documents of that values, which are concatenated together with word-based em-
batch. In addition, the most frequent words in the beddings as input to the encoder.
target dictionary are added until the vocabulary
reaches a fixed size. The aim of this technique
is to reduce the size of the soft-max layer of the 2.3 Modeling Rare/Unseen Words using
decoder which is the main computational bottle- Switching Generator-Pointer
neck. In addition, this technique also speeds up Often-times in summarization, the keywords or
convergence by focusing the modeling effort only named-entities in a test document that are central
on the words that are essential to a given example. to the summary may actually be unseen or rare
This technique is particularly well suited to sum- with respect to training data. Since the vocabulary
marization since a large proportion of the words in of the decoder is fixed at training time, it cannot
the summary come from the source document in emit these unseen words. Instead, a most common
any case. way of handling these out-of-vocabulary (OOV)
words is to emit an ‘UNK’ token as a placeholder.
2.2 Capturing Keywords using Feature-rich However this does not result in legible summaries.
Encoder In summarization, an intuitive way to handle such
In summarization, one of the key challenges is to OOV words is to simply point to their location in
identify the key concepts and key entities in the the source document instead. We model this no-

281
tion using our novel switching decoder/pointer ar- is set to 0 whenever the word at position i in the
chitecture which is graphically represented in Fig- summary is OOV with respect to the decoder vo-
ure 2. In this model, the decoder is equipped with cabulary. At test time, the model decides automat-
a ‘switch’ that decides between using the genera- ically at each time-step whether to generate or to
tor or a pointer at every time-step. If the switch point, based on the estimated switch probability
is turned on, the decoder produces a word from its P (si ). We simply use the arg max of the poste-
target vocabulary in the normal fashion. However, rior probability of generation or pointing to gener-
if the switch is turned off, the decoder instead gen- ate the best output at each time step.
erates a pointer to one of the word-positions in the The pointer mechanism may be more robust in
source. The word at the pointer-location is then handling rare words because it uses the encoder’s
copied into the summary. The switch is modeled hidden-state representation of rare words to decide
as a sigmoid activation function over a linear layer which word from the document to point to. Since
based on the entire available context at each time- the hidden state depends on the entire context of
step as shown below. the word, the model is able to accurately point to
unseen words although they do not appear in the
P (si = 1) = σ(vs · (Whs hi + Wes E[oi−1 ]
target vocabulary.1
+ Wcs ci + bs )),
where P (si = 1) is the probability of the switch
turning on at the ith time-step of the decoder, hi HiddenState OutputLayer
is the hidden state, E[oi−1 ] is the embedding vec-
tor of the emission from the previous time step, G P G G G

ci is the attention-weighted context vector, and


Whs , Wes , Wcs , bs and vs are the switch parame-
ters. We use attention distribution over word posi-
InputLayer

DECODER

tions in the document as the distribution to sample


the pointer from. ENCODER

Pia (j) ∝ exp(va · (Wha hi−1 + Wea E[oi−1 ] Figure 2: Switching generator/pointer model: When the
+ Wca hdj + ba )), switch shows ’G’, the traditional generator consisting of the
pi = arg max(Pia (j)) for j ∈ {1, . . . , Nd }. softmax layer is used to produce a word, and when it shows
j
’P’, the pointer network is activated to copy the word from
In the above equation, pi is the pointer value at one of the source document positions. When the pointer is
ith word-position in the summary, sampled from activated, the embedding from the source is used as input for
the attention distribution Pai over the document the next time-step as shown by the arrow from the encoder to
word-positions j ∈ {1, . . . , Nd }, where Pia (j) is the decoder at the bottom.
the probability of the ith time-step in the decoder
pointing to the j th position in the document, and
hdj is the encoder’s hidden state at position j. 2.4 Capturing Hierarchical Document
At training time, we provide the model with ex- Structure with Hierarchical Attention
plicit pointer information whenever the summary In datasets where the source document is very
word does not exist in the target vocabulary. When long, in addition to identifying the keywords in
the OOV word in summary occurs in multiple doc- the document, it is also important to identify the
ument positions, we break the tie in favor of its key sentences from which the summary can be
first occurrence. At training time, we optimize the drawn. This model aims to capture this notion of
conditional log-likelihood shown below, with ad- two levels of importance using two bi-directional
ditional regularization penalties.
1
X Even when the word does not exist in the source vocabu-
log P (y|x) = (gi log{P (yi |y−i , x)P (si )} lary, the pointer model may still be able to identify the correct
i position of the word in the source since it takes into account
the contextual representation of the corresponding ’UNK’ to-
+(1 − gi ) log{P (p(i)|y−i , x)(1 − P (si ))}) ken encoded by the RNN. Once the position is known, the
corresponding token from the source document can be dis-
where y and x are the summary and document played in the summary even when it is not part of the training
words respectively, gi is an indicator function that vocabulary either on the source side or the target side.

282
RNNs on the source side, one at the word level al., 2002; Erkan and Radev, 2004; Wong et al.,
and the other at the sentence level. The attention 2008a; Filippova and Altun, 2013; Colmenares et
mechanism operates at both levels simultaneously. al., 2015; Litvak and Last, 2008; K. Riedhammer
The word-level attention is further re-weighted by and Hakkani-Tur, 2010; Ricardo Ribeiro, 2013).
the corresponding sentence-level attention and re- Humans on the other hand, tend to paraphrase
normalized as shown below: the original story in their own words. As such, hu-
man summaries are abstractive in nature and sel-
Pwa (j)Psa (s(j))
P a (j) = PNd , dom consist of reproduction of original sentences
a a
k=1 Pw (k)Ps (s(k)) from the document. The task of abstractive sum-
marization has been standardized using the DUC-
where Pwa (j) is the word-level attention weight at
2003 and DUC-2004 competitions.2 The data for
j th position of the source document, and s(j) is
these tasks consists of news stories from various
the ID of the sentence at j th word position, Psa (l)
topics with multiple reference summaries per story
is the sentence-level attention weight for the lth
generated by humans. The best performing system
sentence in the source, Nd is the number of words
on the DUC-2004 task, called TOPIARY (Zajic
in the source document, and P a (j) is the re-scaled
et al., 2004), used a combination of linguistically
attention at the j th word position. The re-scaled
motivated compression techniques, and an unsu-
attention is then used to compute the attention-
pervised topic detection algorithm that appends
weighted context vector that goes as input to the
keywords extracted from the article onto the com-
hidden state of the decoder. Further, we also con-
pressed output. Some of the other notable work in
catenate additional positional embeddings to the
the task of abstractive summarization includes us-
hidden state of the sentence-level RNN to model
ing traditional phrase-table based machine transla-
positional importance of sentences in the docu-
tion approaches (Banko et al., 2000), compression
ment. This architecture therefore models key sen-
using weighted tree-transformation rules (Cohn
tences as well as keywords within those sentences
and Lapata, 2008) and quasi-synchronous gram-
jointly. A graphical representation of this model is
mar approaches (Woodsend et al., 2010).
displayed in Figure 3.
With the emergence of deep learning as a viable
alternative for many NLP tasks (Collobert et al.,
2011), researchers have started considering this
OutputLayer

framework as an attractive, fully data-driven alter-


native to abstractive summarization. In Rush et
Sentencelayer
HiddenState

al. (2015), the authors use convolutional models


to encode the source, and a context-sensitive at-
HiddenState
Wordlayer

tentional feed-forward neural network to generate


DECODER the summary, producing state-of-the-art results on
InputLayer

<eos> Sentence-level attention Gigaword and DUC datasets. In an extension to


ENCODER Word-level attention
this work, Chopra et al. (2016) used a similar con-
volutional model for the encoder, but replaced the
Figure 3: Hierarchical encoder with hierarchical attention:
decoder with an RNN, producing further improve-
the attention weights at the word level, represented by the
ment in performance on both datasets. In another
dashed arrows are re-scaled by the corresponding sentence-
paper that is closely related to our work, Hu et al.
level attention weights, represented by the dotted arrows.
(2015) introduce a large dataset for Chinese short
The dashed boxes at the bottom of the top layer RNN rep-
text summarization. They show promising results
resent sentence-level positional embeddings concatenated to
on their Chinese dataset using an encoder-decoder
the corresponding hidden states.
RNN, but do not report experiments on English
corpora. In another very recent work, Cheng and
3 Related Work Lapata (2016) used RNN based encoder-decoder
for extractive summarization of documents.
A vast majority of past work in summarization Our work starts with the same framework as
has been extractive, which consists of identify- (Hu et al., 2015), but we go beyond the stan-
ing key sentences or passages in the source doc-
ument and reproducing them as summary (Neto et 2
http://duc.nist.gov/

283
dard architecture and propose novel models that (2015). We used the scripts made available by
address critical problems in summarization. We the authors of this work4 to preprocess the data,
analyze the similarities and differences of our pro- which resulted in about 3.8M training examples.
posed models with related work on abstractive The script also produces about 400K validation
summarization below. and test examples, but we created a randomly sam-
Feature-rich encoder (Sec. 2.2): Linguistic fea- pled subset of 2000 examples each for validation
tures such as POS tags, and named-entities as well and testing purposes, on which we report our per-
as TF and IDF information were used in many formance. Further, we also acquired the exact test
extractive approaches to summarization (Wong et sample used in Rush et al. (2015) to make precise
al., 2008b), but they are novel in the context of comparison of our models with theirs. We also
deep learning approaches for abstractive summa- made small modifications to the script to extract
rization, to the best of our knowledge. not only the tokenized words, but also system-
Switching generator-pointer model (Sec. 2.3): generated parts-of-speech and named-entity tags.
This model combines extractive and abstractive Training: For all the models we discuss below, we
approaches to summarization in a single end-to- used 200 dimensional word2vec vectors (Mikolov
end framework. Rush et al. (2015) also used et al., 2013) trained on the same corpus to initial-
a combination of extractive and abstractive ap- ize the model embeddings, but we allowed them
proaches, but their extractive model is a sepa- to be updated during training. The hidden state di-
rate log-linear classifier with handcrafted features. mension of the encoder and decoder was fixed at
Pointer networks (Vinyals et al., 2015) have also 400 in all our experiments. When we used only
been used earlier for the problem of rare words the first sentence of the document as the source,
in the context of machine translation (Luong et as done in Rush et al. (2015), the encoder vocabu-
al., 2015), but the novel addition of switch in our lary size was 119,505 and that of the decoder stood
model allows it to strike a balance between when at 68,885. We used Adadelta (Zeiler, 2012) for
to be faithful to the original source (e.g., for named training, with an initial learning rate of 0.001. We
entities and OOV) and when it is allowed to be cre- used a batch-size of 50 and randomly shuffled the
ative. We believe such a process arguably mim- training data at every epoch, while sorting every
ics how human produces summaries. For a more 10 batches according to their lengths to speed up
detailed treatment of this model, and experiments training. We did not use any dropout or regular-
on multiple tasks, please refer to the parallel work ization, but applied gradient clipping. We used
published by some of the authors of this work early stopping based on the validation set and used
(Gulcehre et al., 2016). the best model on the validation set to report all
Hierarchical attention model (Sec. 2.4): Pre- test performance numbers. For all our models, we
viously proposed hierarchical encoder-decoder employ the large-vocabulary trick, where we re-
models use attention only at sentence-level (Li et strict the decoder vocabulary size to 2,0005 , be-
al., 2015). The novelty of our approach lies in joint cause it cuts down the training time per epoch by
modeling of attention at both sentence and word nearly three times, and helps this and all subse-
levels, where the word-level attention is further in- quent models converge in only 50%-75% of the
fluenced by sentence-level attention, thus captur- epochs needed for the model based on full vocab-
ing the notion of important sentences and impor- ulary.
tant words within those sentences. Concatenation Decoding: At decode-time, we used beam search
of positional embeddings with the hidden state at of size 5 to generate the summary, and limited the
sentence-level is also new. size of summary to a maximum of 30 words, since
this is the maximum size we noticed in the sam-
4 Experiments and Results pled validation set. We found that the average sys-
tem summary length from all our models (7.8 to
4.1 Gigaword Corpus
8.3) agrees very closely with that of the ground
In this series of experiments3 , we used the anno- truth on the validation set (about 8.7 words), with-
tated Gigaword corpus as described in Rush et al. out any specific tuning.
3 4
We used Kyunghyun Cho’s code (https://github. https://github.com/facebook/NAMAS
5
com/kyunghyuncho/dl4mt-material) as the start- Larger values improved performance only marginally,
ing point. but at the cost of much slower training.

284
Computational costs: We trained all our mod- mance compared to its flatter counterpart by learn-
els on a single Tesla K40 GPU. Most models took ing the relative importance of the first two sen-
about 10 hours per epoch on an average except the tences automatically.
hierarchical attention model, which took 12 hours
feats-lvt2k-2sent: Here, we still train on the first
per epoch. All models typically converged within
two sentences, but we exploit the parts-of-speech
15 epochs using our early stopping criterion based
and named-entity tags in the annotated gigaword
on the validation cost. The wall-clock training
corpus as well as TF, IDF values, to augment the
time until convergence therefore varies between
input embeddings on the source side as described
6-8 days depending on the model. Generating
in Sec 2.2. In total, our embedding vector grew
summaries at test time is reasonably fast with a
from the original 100 to 155, and produced incre-
throughput of about 20 summaries per second on
mental gains compared to its counterpart words-
a single GPU, using a batch size of 1.
lvt2k-2sent as shown in Table 1, demonstrating the
Evaluation metrics: In Rush et al. (2015), the
utility of syntax based features in this task.
authors used full-length version of Rouge recall6
to evaluate their systems on the Gigaword cor- feats-lvt2k-2sent-ptr: This is the switching gener-
pus7 . However, full-length recall favors longer ator/pointer model described in Sec. 2.3, but in
summaries, so it may not be fair to use this met- addition, we also use feature-rich embeddings on
ric to compare two systems that differ in summary the document side as in the above model. Our ex-
lengths. Full-length F1 solves this problem since periments indicate that the new model is able to
it can penalize longer summaries. Therefore, we achieve the best performance on our test set by all
use full-length F1 scores from 1, 2 and L variants three Rouge variants as shown in Table 1.
of Rouge using the official script to evaluate our Comparison with state-of-the-art: (Rush et al.,
systems. However, in the interest of fair compari- 2015) reported recall-only from full-length version
son with previous work, we also report full-length of Rouge, but the authors kindly provided us with
recall scores where necessary. In addition, we also their F1 numbers, as well as their test sample. We
report the percentage of tokens in the system sum- compared the performance of our model words-
mary that occur in the source (which we call ‘src. lvt2k-1sent with their best system on their sample,
copy rate’ in Table 1). on both Recall as well as F1, as displayed in Table
We describe all our experiments and results on the 1. The reason we did not evaluate our best models
Gigaword corpus below. here is that this test set consisted of only 1 sen-
words-lvt2k-1sent: This is the baseline attentional tence from the source document, and did not in-
encoder-decoder model with the large vocabulary clude NLP annotations, which are needed in our
trick. This model is trained only on the first sen- best models. The table shows that, despite this
tence from the source document, as done in Rush fact, our model outperforms the state of the art
et al. (2015). model of Rush et al. (2015), on both recall and
words-lvt2k-2sent: This model is identical to the F1, with statistical significance. In addition, our
model above except for the fact that it is trained models exhibit better abstractive ability as shown
on the first two sentences from the source. On by the src. copy rate metric in the last column of
this corpus, adding the additional sentence in the the table.
source does seem to aid performance, as shown
in Table 1. We also tried adding more sentences, We believe the bidirectional RNN we used to
but the performance dropped, which is probably model the source captures richer contextual infor-
because the latter sentences in this corpus are not mation of every word than the bag-of-embeddings
pertinent to the summary. representation used by Rush et al. (2015) in their
words-lvt2k-2sent-hieratt: Since we used two sen- convolutional and attentional encoders, which
tences from source document, we trained the hi- might explain our superior performance. Further,
erarchical attention model proposed in Sec 2.4. explicit modeling of important information such
As shown in Table 1, this model improves perfor- as multiple source sentences, word-level linguis-
tic features, using the switch mechanism to point
6
http://www.berouge.com/Pages/default. to source words when needed, and hierarchical at-
aspx
7
confirmed from personal communication with the first- tention, solve specific problems in summarization,
author of the paper. each boosting performance incrementally.

285
# Model name Rouge-1 Rouge-2 Rouge-L Src. copy rate (%)
Full length F1 on our internal test set
1 words-lvt2k-1sent 34.97 17.17 32.70 75.85
2 words-lvt2k-2sent 35.73 17.38 33.25 79.54
3 words-lvt2k-2sent-hieratt 36.05 18.17 33.52 78.52
4 feats-lvt2k-2sent 35.90 17.57 33.38 78.92
5 feats-lvt2k-2sent-ptr *36.40 17.77 *33.71 78.70
Full length Recall on the test set used by (Rush et al., 2015)
6 ABS+ (Rush et al., 2015) 31.47 12.73 28.54 91.50
7 words-lvt2k-1sent *34.19 *16.29 *32.13 74.57
Full length F1 on the test set used by (Rush et al., 2015)
8 ABS+ (Rush et al., 2015) 29.78 11.89 26.97 91.50
9 words-lvt2k-1sent *32.67 *15.59 *30.64 74.57

Table 1: Performance comparison of various models. ’*’ indicates statistical significance of the corresponding model with
respect to the baseline model on its dataset as given by the 95% confidence interval in the official Rouge script. We report
statistical significance only for the best performing models. ’src. copy rate’ for the reference data on our validation sample is
45%. Please refer to Section 4 for explanation of notation.

4.2 DUC Corpus consistently outperforms ABS+ on all three vari-


The DUC corpus8 comes in two parts: the 2003 ants of Rouge, the differences are not statistically
corpus consisting of 624 document, summary significant. However, when the comparison is
pairs and the 2004 corpus consisting of 500 pairs. made with ABS model, which is really the true
Since these corpora are too small to train large un-tuned counterpart of our model, the results are
neural networks on, Rush et al. (2015) trained indeed statistically significant.
their models on the Gigaword corpus, but com- Model Rouge-1 Rouge-2 Rouge-L
bined it with an additional log-linear extractive TOPIARY 25.12 6.46 20.12
summarization model with handcrafted features, ABS 26.55 7.06 22.05
ABS+ 28.18 8.49 23.81
that is trained on the DUC 2003 corpus. They words-lvt2k-1sent 28.35 9.46 24.59
call the original neural attention model the ABS
model, and the combined model ABS+. The lat- Table 2: Evaluation of our models using the limited-length
ter model is current state-of-the-art since it outper- Rouge Recall on DUC validation and test sets. Our best
forms all previously published baselines includ- model, although trained exclusively on the Gigaword corpus,
ing non-neural network based extractive and ab- consistently outperforms the ABS+ model which is tuned on
stractive systems, as measured by the official DUC the DUC-2003 validation corpus in addition to being trained
metric of limited-length recall. In these exper- on the Gigaword corpus.
iments, we use the same metric to evaluate our
models too, but we omit reporting numbers from We would also like to bring the reader’s atten-
other systems in the interest of space. tion to the concurrently published work of Chopra
In our work, we simply run the model trained et al. (2016) where they also used an RNN based
on Gigaword corpus as it is, without tuning it on decoder for summary generation. While their
the DUC validation set. The only change we made numbers on Gigaword corpus are slightly better
to the decoder is to suppress the model from emit- than our best performance on all three Rouge F1
ting the end-of-summary tag, and force it to emit metrics, our performance is marginally higher on
exactly 30 words for every summary, since the of- DUC-2004 corpus on Rouge-2 and Rouge-L. We
ficial evaluation on this corpus is based on limited- believe their work also confirms the effectiveness
length Rouge recall. On this corpus too, since we of RNN-based models for abstractive text summa-
have only a single sentence from source and no rization.
NLP annotations, we ran just the model words- 4.3 CNN/Daily Mail Corpus
lvt2k-1sent.
The existing abstractive text summarization cor-
The performance of this model on the test set
pora including Gigaword and DUC consist of only
is compared with ABS and ABS+ models, as well
one sentence in each summary. In this section,
as TOPIARY, the top performing system on DUC-
we present a new corpus that comprises multi-
2004 in Table 2. We note that although our model
sentence summaries. To produce this corpus, we
8
http://duc.nist.gov/duc2004/tasks.html modify an existing corpus that has been used

286
Model Rouge-1 Rouge-2 Rouge-L archical attention model was very expensive, con-
words-lvt2k 32.49 11.84 29.47
words-lvt2k-ptr 32.12 11.72 29.16 suming nearly 12.5 hours per epoch. Convergence
words-lvt2k-hieratt 31.78 11.56 28.73 of all models is also slower on this dataset com-
pared to Gigaword, taking nearly 35 epochs for
Table 3: Performance of various models on CNN/Daily
all models. Thus, the wall-clock time for train-
Mail test set using full-length Rouge-F1 metric. Bold faced
ing until convergence is about 7 days for the flat
numbers indicate best performing system.
models, but nearly 18 days for the hierarchical at-
for the task of passage-based question answering tention model. Decoding is also slower as well,
(Hermann et al., 2015). In this work, the au- with a throughput of 2 examples per second for
thors used the human generated abstractive sum- flat models and 1.5 examples per second for the
mary bullets from new-stories in CNN and Daily hierarchical attention model, when run on a single
Mail websites as questions (with one of the enti- GPU with a batch size of 1.
ties hidden), and stories as the corresponding pas- Evaluation: We evaluated our models using the
sages from which the system is expected to an- full-length Rouge F1 metric that we employed for
swer the fill-in-the-blank question. The authors re- the Gigaword corpus, but with one notable differ-
leased the scripts that crawl, extract and generate ence: in both system and gold summaries, we con-
pairs of passages and questions from these web- sidered each highlight to be a separate sentence.9
sites. With a simple modification of the script, we Results: Results from three models we ran on
restored all the summary bullets of each story in this corpus are displayed in Table 3. Although
the original order to obtain a multi-sentence sum- this dataset is smaller and more complex than the
mary, where each bullet is treated as a sentence. In Gigaword corpus, it is interesting to note that the
all, this corpus has 286,817 training pairs, 13,368 Rouge numbers are in the same range. However,
validation pairs and 11,487 test pairs, as defined our switching pointer/generator model as well as
by their scripts. The source documents in the train- the hierarchical attention model described in Sec.
ing set have 766 words spanning 29.74 sentences 2.4 fail to outperform the baseline attentional de-
on an average while the summaries consist of 53 coder, indicating that further research and experi-
words and 3.72 sentences. The unique character- mentation needs to be done on this dataset. These
istics of this dataset such as long documents, and results, although preliminary, should serve as a
ordered multi-sentence summaries present inter- good baseline for future researchers to compare
esting challenges, and we hope will attract future their models against.
researchers to build and test novel models on it.
5 Qualitative Analysis
The dataset is released in two versions: one
consisting of actual entity names, and the other, Table 4 presents a few high quality and poor qual-
in which entity occurrences are replaced with ity output on the validation set from feats-lvt2k-
document-specific integer-ids beginning from 0. 2sent, one of our best performing models. Even
Since the vocabulary size is smaller in the when the model differs from the target summary,
anonymized version, we used it in all our exper- its summaries tend to be very meaningful and rel-
iments below. We limited the source vocabulary evant, a phenomenon not captured by word/phrase
size to 150K, and the target vocabulary to 60K, matching evaluation metrics such as Rouge. On
the source and target lengths to at most 800 and the other hand, the model sometimes ‘misinter-
100 words respectively. We used 100-dimensional prets’ the semantics of the text and generates a
word2vec embeddings trained on this dataset as summary with a comical interpretation as shown
input, and we fixed the model hidden state size at in the poor quality examples in the table. Clearly,
200. We also created explicit pointers in the train- capturing the ‘meaning’ of complex sentences re-
ing data by matching only the anonymized entity- mains a weakness of these models.
ids between source and target on similar lines as Our next example output, presented in Figure
we did for the OOV words in Gigaword corpus. 4, displays the sample output from the switching
Computational costs: We used a single Tesla K- generator/pointer model on the Gigaword corpus.
40 GPU to train our models on this dataset as well. 9
This was done by modifying the pre-processing script
While the flat models (words-lvt2k and words- such that each highlight gets its own "<a>" tag in the xml file
lvt2k-ptr) took under 5 hours per epoch, the hier- that goes as input to the evaluation script.

287
Good quality summary output
S: a man charged with the murder last year of a british back-
packer confessed to the slaying on the night he was charged
with her killing , according to police evidence presented at a
court hearing tuesday . ian douglas previte , ## , is charged
with murdering caroline stuttle , ## , of yorkshire , england
T: man charged with british backpacker ’s death confessed
to crime police officer claims
O: man charged with murdering british backpacker con-
fessed to murder
S: following are the leading scorers in the english premier
league after saturday ’s matches : ## - alan shearer -lrb-
newcastle united -rrb- , james beattie . Figure 4: Sample output from switching generator/pointer
T: leading scorers in english premier league
O: english premier league leading scorers networks. An arrow indicates that a pointer to the source po-
S: volume of transactions at the nigerian stock exchange sition was used to generate the corresponding summary word.
has continued its decline since last week , a nse official said
thursday . the latest statistics showed that a total of ##.###
million shares valued at ###.### million naira -lrb- about
#.### million us dollars -rrb- were traded on wednesday in It is apparent from the examples that the model
, deals . learns to use pointers very accurately not only for
T: transactions dip at nigerian stock exchange named entities, but also for multi-word phrases.
O: transactions at nigerian stock exchange down
Poor quality summary output Despite its accuracy, the performance improve-
S: broccoli and broccoli sprouts contain a chemical that kills ment of the overall model is not significant. We
the bacteria responsible for most stomach cancer , say re- believe the impact of this model may be more pro-
searchers , confirming the dietary advice that moms have
been handing out for years . in laboratory tests the chemical nounced in other settings with a heavier tail distri-
, <unk> , killed helicobacter pylori , a bacteria that causes bution of rare words. We intend to carry out more
stomach ulcers and often fatal stomach cancers .
T: for release at #### <unk> mom was right broccoli is
experiments with this model in the future.
good for you say cancer researchers On CNN/Daily Mail data, although our models
O: broccoli sprouts contain deadly bacteria
are able to produce good quality multi-sentence
S: norway delivered a diplomatic protest to russia on mon-
day after three norwegian fisheries research expeditions summaries, we notice that the same sentence or
were barred from russian waters . the norwegian research phrase often gets repeated in the summary. We be-
ships were to continue an annual program of charting fish
resources shared by the two countries in the barents sea re-
lieve models that incorporate intra-attention such
gion . as Cheng et al. (2016) can fix this problem by en-
T: norway protests russia barring fisheries research ships couraging the model to ‘remember’ the words it
O: norway grants diplomatic protest to russia
S: j.p. morgan chase ’s ability to recover from a slew of
has already produced in the past.
recent losses rests largely in the hands of two men , who are
both looking to restore tarnished reputations and may be
considered for the top job someday . geoffrey <unk> , now 6 Conclusion
the co-head of j.p. morgan ’s investment bank , left goldman
, sachs & co. more than a decade ago after executives say In this work, we apply the attentional encoder-
he lost out in a bid to lead that firm .
T: # executives to lead j.p. morgan chase on road to recov- decoder for the task of abstractive summarization
ery with very promising results, outperforming state-
O: j.p. morgan chase may be considered for top job
of-the-art results significantly on two different
datasets. Each of our proposed novel models ad-
Table 4: Examples of generated summaries from our best
dresses a specific problem in abstractive summa-
model on the validation set of Gigaword corpus. S: source
rization, yielding further improvement in perfor-
document, T: target summary, O: system output. Although
mance. We also propose a new dataset for multi-
we displayed equal number of good quality and poor quality
sentence summarization and establish benchmark
summaries in the table, the good ones are far more prevalent
numbers on it. As part of our future work, we plan
than the poor ones.
to focus our efforts on this data and build more ro-
bust models for summaries consisting of multiple
sentences.

288
References Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati,
Bowen Zhou, and Yoshua Bengio. 2016. Pointing
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua the unknown words. In Proceedings of the 54th An-
Bengio. 2014. Neural machine translation by nual Meeting of the Association for Computational
jointly learning to align and translate. CoRR, Linguistics.
abs/1409.0473.
Karl Moritz Hermann, Tomás Kociský, Edward
Dzmitry Bahdanau, Jan Chorowski, Dmitriy Serdyuk, Grefenstette, Lasse Espeholt, Will Kay, Mustafa
Philemon Brakel, and Yoshua Bengio. 2015. Suleyman, and Phil Blunsom. 2015. Teach-
End-to-end attention-based large vocabulary speech ing machines to read and comprehend. CoRR,
recognition. CoRR, abs/1508.04395. abs/1506.03340.
Michele Banko, Vibhu O. Mittal, and Michael J Wit- Baotian Hu, Qingcai Chen, and Fangze Zhu. 2015. Lc-
brock. 2000. Headline generation based on statis- sts: A large scale chinese short text summarization
tical translation. In Proceedings of the 38th Annual dataset. In Proceedings of the 2015 Conference on
Meeting on Association for Computational Linguis- Empirical Methods in Natural Language Process-
tics, 22:318–325. ing, pages 1967–1972, Lisbon, Portugal, September.
Association for Computational Linguistics.
Jianpeng Cheng and Mirella Lapata. 2016. Neural
summarization by extracting sentences and words. Sébastien Jean, Kyunghyun Cho, Roland Memisevic,
In Proceedings of the 54th Annual Meeting of the and Yoshua Bengio. 2014. On using very large
Association for Computational Linguistics. target vocabulary for neural machine translation.
CoRR, abs/1412.2007.
Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016.
Long short-term memory-networks for machine B. Favre K. Riedhammer and D. Hakkani-Tur. 2010.
reading. CoRR, abs/1601.06733. Long story short âĂŞ global unsupervised models
for keyphrase based meeting summarization. In
Sumit Chopra, Michael Auli, and Alexander M. Rush. Speech Communication, pages 801–815.
2016. Abstractive sentence summarization with at-
tentive recurrent neural networks. In HLT-NAACL. Jiwei Li, Minh-Thang Luong, and Dan Jurafsky. 2015.
A hierarchical neural autoencoder for paragraphs
Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and documents. CoRR, abs/1506.01057.
and Yoshua Bengio. 2014. Empirical evaluation of
gated recurrent neural networks on sequence model- M. Litvak and M. Last. 2008. Graph-based keyword
ing. CoRR, abs/1412.3555. extraction for single-document summarization. In
Coling 2008, pages 17–24.
Trevor Cohn and Mirella Lapata. 2008. Sentence
compression beyond word deletion. In Proceedings Thang Luong, Ilya Sutskever, Quoc V. Le, Oriol
of the 22Nd International Conference on Computa- Vinyals, and Wojciech Zaremba. 2015. Addressing
tional Linguistics - Volume 1, pages 137–144. the rare word problem in neural machine translation.
In Proceedings of the 53rd Annual Meeting of the
Ronan Collobert, Jason Weston, Léon Bottou, Michael Association for Computational Linguistics and the
Karlen, Koray Kavukcuoglu, and Pavel P. Kuksa. 7th International Joint Conference on Natural Lan-
2011. Natural language processing (almost) from guage Processing of the Asian Federation of Natural
scratch. CoRR, abs/1103.0398. Language Processing, pages 11–19.

Carlos A. Colmenares, Marina Litvak, Amin Mantrach, Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Cor-
and Fabrizio Silvestri. 2015. Heads: Headline rado, and Jeffrey Dean. 2013. Distributed represen-
generation as sequence prediction using an abstract tations of words and phrases and their composition-
feature-rich space. In Proceedings of the 2015 Con- ality. CoRR, abs/1310.4546.
ference of the North American Chapter of the Asso-
ciation for Computational Linguistics: Human Lan- Joel Larocca Neto, Alex Alves Freitas, and Celso A. A.
guage Technologies, pages 133–142. Kaestner. 2002. Automatic text summarization us-
ing a machine learning approach. In Proceedings
G. Erkan and D. R. Radev. 2004. Lexrank: Graph- of the 16th Brazilian Symposium on Artificial Intel-
based lexical centrality as salience in text summa- ligence: Advances in Artificial Intelligence, pages
rization. Journal of Artificial Intelligence Research, 205–215.
22:457–479.
David Martins de Matos JoÃčo P. Neto Ana-
Katja Filippova and Yasemin Altun. 2013. Overcom- tole Gershman Jaime Carbonell Ricardo Ribeiro,
ing the lack of parallel data in sentence compression. Luàs Marujo. 2013. Self reinforcement for im-
In Proceedings of the 2013 Conference on Empiri- portant passage retrieval. In 36th international ACM
cal Methods in Natural Language Processing, pages SIGIR conference on Research and development in
1481–1491. information retrieval, pages 845–848.

289
Alexander M. Rush, Sumit Chopra, and Jason Weston.
2015. A neural attention model for abstractive sen-
tence summarization. CoRR, abs/1509.00685.
Subhashini Venugopalan, Marcus Rohrbach, Jeff Don-
ahue, Raymond J. Mooney, Trevor Darrell, and Kate
Saenko. 2015. Sequence to sequence - video to text.
CoRR, abs/1505.00487.
O. Vinyals, M. Fortunato, and N. Jaitly. 2015. Pointer
Networks. ArXiv e-prints, June.

Kam-Fai Wong, Mingli Wu, and Wenjie Li. 2008a.


Extractive summarization using supervised and
semi-supervised learning. In Proceedings of the
22Nd International Conference on Computational
Linguistics - Volume 1, pages 985–992.
Kam-Fai Wong, Mingli Wu, and Wenjie Li. 2008b.
Extractive summarization using supervised and
semi-supervised learning. In Proceedings of the
22nd Annual Meeting of the Association for Com-
putational Linguistics, pages 985–992.
Kristian Woodsend, Yansong Feng, and Mirella Lap-
ata. 2010. Title generation with quasi-synchronous
grammar. In Proceedings of the 2010 Conference on
Empirical Methods in Natural Language Process-
ing, EMNLP ’10, pages 513–523, Stroudsburg, PA,
USA. Association for Computational Linguistics.
David Zajic, Bonnie J. Dorr, and Richard Schwartz.
2004. Bbn/umd at duc-2004: Topiary. In Proceed-
ings of the North American Chapter of the Asso-
ciation for Computational Linguistics Workshop on
Document Understanding, pages 112–119.
Matthew D. Zeiler. 2012. ADADELTA: an adaptive
learning rate method. CoRR, abs/1212.5701.

290

View publication stats

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy