TAC Submission With LML-2
TAC Submission With LML-2
Abstract—Affective lexicon is one of the most important resource in affective computing for text. Manually constructed affective
lexicons have limited scale and thus only have limited use in practical systems. In this work, we propose a regression-based method to
automatically infer multi-dimensional affective representation of words via their word embedding based on a set of seed words. This
method can make use of the rich semantic meanings obtained from word embedding to extract meanings in some specific semantic
space. This is based on the assumption that different features in word embedding contribute differently to a particular affective
dimension and a particular feature in word embedding contributes differently to different affective dimensions. Evaluation on various
affective lexicons shows that our method outperforms the state-of-the-art methods on all the lexicons under different evaluation metrics
with large margins. We also explore different regression models and conclude that the Ridge regression model, the Bayesian Ridge
regression model and Support Vector Regression with linear kernel are the most suitable models. Comparing to other state-of-the-art
methods, our method also has computation advantage. Experiments on a sentiment analysis task show that the lexicons extended by
our method achieve better results than publicly available sentiment lexicons on eight sentiment corpora. The extended lexicons are
publicly available for access.
1 INTRODUCTION
1. https://dumps.wikimedia.org/enwiki/latest/ Accessed May 17, 4. Though we can directly predict discrete labels using logistic
2017 regression on word embedding, the baseline methods can only produce
2. http://www.nlpcn.org/resource/list/2 Accessed May 17, 2017 scalar value. To be consistent with the baselines, we also predict the
3. http://www.ltp-cloud.com/ Accessed May 17, 2017 scalar value using a linear regression model.
TABLE 3
Result on Inferring Affective Meaning
Based on the results from these tables, we make five major Perceptual lexicon, RoWE achieves a relative improvement
observations. (1) RoWE outperforms the other methods with over Wt-Graph of 26.2, 27.3, 22.8, 24.4 percent under RMSE,
large margins on all the affective dimensions of all the lexi- MAE, t and ac1s respectively. (2) Among different evalua-
cons under all the evaluation metrics. For example, on the GI tion metrics, rankings on RMSE, MAE and MAPE are simi-
lexicon, RoWE has a relative improvement of 1.2 percent on lar. But, Kendall correlation coefficient are different. For
AUC and 1.3 percent on Macro-F1 over the state-of-the-art example, the ranking for RMSE from best to worst is RoWE,
Wt-Graph method. On the ANEW lexicon, RoWE outper- Wt-Graph, PMI, SENTPROP, Web-GP and qwn-ppv, DENSIFIER.
forms the state-of-the-art Wt-Graph method with relative However, the ranking for t is RoWE, Wt-Graph, qwn-ppv,
improvement of 36.8, 47.1, 49.2, 14.2, and 51.5 percent for SENTPROP, PMI, Web-GP, DENSIFIER. The ranking for ac1s is
RMSE, MAE, MAPE, and the Kendall correlation coefficient RoWE, Wt-Graph, SENTPROP, Web-GP and qwn-ppv, PMI,
t metrics, respectively. On the touching dimension of DENSIFIER. This means that different methods may have their
TABLE 4
Example Words Close in Embedding Space, But Not Close in
Predicted Affective Space
Fig. 4. The learned weights of different affecetive meanings for the For example, the nearest word of cold is warm while their
ANEW lexicon. predicted valence value are 4.16 and 7.09 respectively. This
validates that our method can distinguish the affective
merits under different performance measures. (3) To con- meanings through assigning different weights to the fea-
sider the different dimensions for the VAD lexicons, the per- tures in the embedding space.
formance on valence for t is better than on arousal and
dominance. However, it is opposite for ac1s , RMSE, MAE 4.2 Method Complexity
and MAPE. This may be because t focuses on the ranking The complexity of different methods are shown in Table 5.
rather than value difference between the gold value and the In this table, N is the data sample size, d is the embedding
predicted value, whereas the other evaluation metrics focus dimension and k is the number of nearest neighbors used in
on the value difference between the gold value and the pre- Web-GP and SENTPROP. d and k are set as constants during
dicted value. (4) For the E-ANEW lexicon, which is anno- experiment. The second column in the table indicates that
tated through crowdsourcing, the mean absolute errors the asymptotic complexities of PMI, Web-GP, Wt-Graph
(MAE) of our method are 0.65, 0.58, 0.56 on valence, arousal, and SENTPROP grow quadratically with the data size,
dominance, respectively. This means that the predicted val- whereas the complexities of DENSIFIER and our RoWE grows
ues are quite close to the manually annotated values. On the linearly with the data size. The third column in the table
ac1s metric, our method’s performance achieves 93.4, 99.1, shows the complexity with constant coefficients d and k.
99.0 percent on valence, arousal, dominance, respectively. Even though d and k do not have a role to play in Big O
This means that almost all the predicted values are in one analysis, as shown in the second column, they do affect the
standard deviation of the manually annotated mean value. efficiency of the implementations especially when data sam-
(5) The standard deviations shown in parentheses of the sen- ples have limited size.
timent lexicons indicate that RoWE has smaller relative stan- To further examine their run time efficiency, we also run
dard deviations. In other words, RoWE is more robust and is an experiment to visually observe the difference in comput-
less seed word sensitive. ing time by varying the data size from 1,000 to 11,000 using
In conclusion, our proposed RoWE method achieves the the E-ANEW lexicon and set the seed word number to 300.
best result on all the lexicons under all the evaluation met- The remaining collection is used as test data. The hardware
rics, which validates our assumption that word embeddings platform is a desktop computer with processor of Intel (R)
do encode semantic information and the regression model Xeon (R) CPU E5-1620 and 64G RAM and during running
can effectively decode the affective meanings from the each method, we close all the other programs. The result is
embeddings by assigning different weights to different shown in Fig. 5. Web-GP is not listed because its running
dimensions in the embedding. Fig. 4 shows a visualized time is too high ranging from about 23,900 to 38,000 (in
weight values of ~ a on the first ten dimensions in the vector micro seconds). The figure shows that RoWE requires the
space of word embedding to the three affective dimensions least running time. When the data size increase from 1,000
on ANEW lexicon for the VAD model. Note that the weights to 11,000, the the running time of RoWE changes from 11 to
for the three affective dimensions can be quite different. For
example, for the first vector in embedding, its correspond-
TABLE 5
ing affectives weights are 1.11, -1.05, and 0.63, respectively. Complexity of Different Methods
Table 4 lists some example words in the ANEW lexicon
that are close in embedding space but not close in the Method Asymptotic Complexity with
valence dimension. In the table, the word column is the tar- Complexity coefficient
get word, the G val column is the gold valence value, P val is PMI OðN 2 Þ OðN 2 Þ
the predicted valence value, and the last column is the top 5 Web-GP OðN 2 Þ OðN 2 kdÞ
nearest words in embedding space based on cosine similar- Wt-Graph OðN 2 Þ OðN 2 dÞ
DENSIFIER OðNÞ OðNd3 Þ
ity. The value in the parenthesis is the predicted valence
SENTPROP OðN 2 Þ OðN 2 kdÞ
value. The words in bold are examples that are close in the RoWE OðNÞ OðNd2 Þ
embedding space but not close in the valence dimension.
Fig. 7. The effects of embedding dimension.
Fig. 5. The running time of different methods under different data size.
We break the y axis at 5,000 to 6,000 to make the figure more readable.
The numbers in parenthesis are the running time. result is shown in Fig. 7. Note that as the dimension increases
from 50 to 300, the performance improves steadily. However,
116 which basically translates a linear increase of 10.5 times. between 300 to 500, the curve is quite flat. Generally speak-
Although running time may be affected by actual imple- ing, larger dimensions do bring better performance, but it
mentations, this experiment can still reveal the computation would require more resources and computation power. To
advantage of RoWE over the other methods. In conclusion, balance the performance and computation cost, we suggest
RoWE has complexity advantage over the other methods. to set the dimension between 300 to 400.
Fig. 6. The effect of seed word size on the ANEW lexicon. 5. scikit-learn.org/ Accessed May 17, 2017
slightly better than wikiEmb because GoogleEmb uses a
much larger training corpus. Since evaluating embedding
quality is not our focus, for the detailed discussion on the
quality of embedding methods, we suggest the paper [56].
Other than MVEmb, which seems to be low in performance,
all the other embeddings have comparable performance.
Even though CVNE has the best performance in this experi-
ment, it only indicates the usefulness of adding knowledge
base information to a non-supervised training method. It
does not by any means guarantee that CVNE is the best per-
former on a downstream task because the lexicon size is lim-
ited by the coverage of the knowledge base.
Table 6 shows the example words with the top 5 largest
Fig. 8. The performance of different regression models on the VADER
and top 5 smallest predicted values in each affective dimen-
lexicon. sion under different affective models using the Ridge
regression based on corresponding seed lexicons and
denoted as wikiEmb with size 204,981, as explained in CVNE embedding. Note that all the learned top words are
Section 4.2, we also use the following public available embed- quite reasonable. As sentiment indicators, ANEW-v, EPA-e
dings that are obtained from different learning methods. has the same word giving gift. Several words do get listed in
different lexicons such as giving gift, make happy. Note that
1) Google embedding (GoogleEmb) [52]: It is trained our method is not limited to predict the affective meaning
using the SGNS model as introduced in Section 3.1 of words only. Phrase prediction is not a problem in general
from a news corpus of 10 billion tokens.6 The embed- as long as phrase embeddings are given. Interestingly, on
ding vocabulary size is 3,000,000. the Concreteness, the last word istically actually is the
2) Glove 840B (Glove) [55]: It is based on weighted adverb suffix, which is quite abstract.
matrix factorization on the co-occurrence matrix
built from a corpus consisting of 840 billion tokens.7 4.6 Downstream Task for Sentiment Classification
The embedding vocabulary size is 2,196,017. In this section, we evaluate the effectiveness RoWE through
3) Meta-Embedding (MetaEmb) [65]: This method the performance of a downstream sentiment analysis task.
ensembles different embedding sources to obtain the In this experiment, we examine the effectiveness of the lexi-
final meta-embedding.8 The size is 2,746,092 cons obtained from RoWE compared to baseline lexicons
4) ConceptNet Vector Ensemble (CNVE) [70]: This method obtained from other methods including both manual ones
combines word2vec, Glove with structured knowl- and automatically obtained ones. The sentiment corpora
edge from ConceptNet [71].9 The size is 426,572. used in the experiment are listed in Table 8. The baseline
5) MVLSA (MVEmb) [62]: This method learns word lexicons are all publicly available and are listed in Table 9.
embedding from multiple sources including text cor- The list of lexicons is sorted according to their size. Note
pus, dependency relation, morphology, monolingual that the ANEW, VADER and E-ANEW are obtained manu-
corpus, knowledge base from FramNet based on ally or through crowdsourcing and the others are obtained
generalized canonical correlation analysis.10 The size automatically.
is 361,082. The setup of the experiment is to first use RoWE to
6) Paragram Embedding (ParaEmb) [72]: This method extend the VADER sentiment lexicon using different
learns word embeddings based on the paraphrase embeddings introduced in Section 4.5. RoWE is trained
constraint from PPDB.11 The size is 1,703,756. using the intersection of the VADER lexicon and the respec-
We test the embeddings on the common set of 1,079 words tive embeddings. The size for each of the extended lexicon
in all the selected embedding resources and the VADER lexi- is different depending on the vocabulary of the embed-
con. Among the 1,079 words, we randomly select 50 percent dings. For a fair comparison, we use the same downstream
as seed words and the other 50 percent as test words. We run sentiment classification method for all the different lexicons.
each experiment 5 times and report the average performance We use the VADER method for sentiment classification [39]
with standard deviation in the parenthesis as shown in because it is a lexicon-based method using heuristic rules.
Table 7. Note that the knowledge based CVNE achieves the We did not use any machine learning method to avoid the
best result under all the evaluation metrics, which indicates effects of other factors other than the evaluated lexicons.
that distilling knowledge base into embedding can improve The VADER method can better reflect the quality of the
the quality of word embedding. GoogleEmb performs evaluated sentiment lexicons. In the sentiment analysis
task, we use F-score as the evaluation metric.
Table 10 shows the evaluation result and the best results
6. https://code.google.com/archive/p/word2vec/ Accessed May 17, are in bold. Note that all the lexicons obtained by using
2017
7. http://nlp.stanford.edu/projects/glove/ Accessed May 17, 2017 RoWE are listed in the second part of the table and the size of
8. http://cistern.cis.lmu.de/meta-emb/ Accessed May 17, 2017 each obtained lexicon is included in parenthesis. In general,
9. https://github.com/commonsense/conceptnet-numberbatch the embedding based lexicons perform better than the base-
Accessed May 17, 2017
10. http://cs.jhu.edu/prastog3/mvlsa/Accessed May 17, 2017 line lexicons. The ParaEmb lexicon, in particular, achieves
11. http://ttic.uchicago.edu/wieting/Accessed May 17, 2017 the best result on all the sentiment corpora. In the baseline
TABLE 6
Example Words with Top 5 Largest and Smallest Predicted Affective Values Based on CVNE Embedding
VADER ANEW-v ANEW-a ANEW-d EPA-e EPA-p EPA-a DAL-e DAL-a DAL-i Concreteness
Examples words of top 5 largest predicted affective values
giving gift giving gift insanity paradise giving gift god raver giving gift dangerous neighbor’s non
activity house powered
device
making making gun win heaven ceo riot making climbing non opaque
happy happy happy mountain powered thing
device
excellentness make sex positive make christ gunfight make playing own home power
happy attitude happy happy snooker shovel
excavator
life of party reading rampage incredible making herculean fighter showing winning opaque non
books happy strength love game thing agentive
artifact
winning positive tornado self positive pope nightclub enjoying playing single user single
baseball attitude attitude day cricket device user
game device
Examples words of top 5 smallest predicted affective values
hell with stabbing to soothing uncontroll- able hell coward glum mommick scar that degree more equal
death
unpleasant life librarian earthquake murder weakling cemetery unpleasant shadows risibility confessedly
person threatening person
condition
hagridden poor devil dull lobotomy rape high and funeral plague elementary in such way hypostatize
dry
abusive crybully calm alzheimers unpleasant slave mummy plaguer supplement inhere neuter
language person substantive
hagride abusive grain dementia rapist powerless graveyard nidder oxgang in this istically
language
lexicons, SentiWords performs the best. We want to point out NNLexicon and Tang have about 184 K and 347 K respec-
that in both the baseline lexicons and lexicons obtained from tively. The best performer ParaEmb is also not the largest in
RoWE, lexicon size is not the determiner for the best perfor- lexicon size. In fact, CVNE which is only 0.4 M in size has
mance. Among the baseline lexicons, the best performer, very good performance. Note that MetaEmb performs much
SentiWords has only about 147K sentiment words whereas worse than other embedding based lexicons. Further analy-
sis indicates that although the size of MetaEmb is large, the
TABLE 7 overlap size of MetaEmb with the sentiment corpus vocabu-
Evaluation of Different Embeddings on lary is quite low, For example, there are only 512 overlapping
VADER Lexicon Using RoWE seeds out of 6,298 (10 percent) in the mpqa corpus compared
Method RMSE MAE t ac1s to 6,193 of ParaEmb. Also, most of the words in MetaEmb are
informal strings, such as rates.download, now!download. The
wikiEmb 1.2(.02) .96(.01) 49.9(1.1) 53.6(1.0)
general conclusion is that (1) the larger overlapping is gener-
GoogleEmb 1.1(.01) .86(.01) 55.4(1.0) 57.6(1.5)
Glove 1.0(.02) .80(.02) 59.4(1.2) 61.7(1.5) ally good, but again it is not the determining factor; and (2)
CVNE .88(.01) .69(.01) 66.0(.95) 67.3(1.2) the high quality word embedding also helps even if its size is
MetaEmb 1.1(.03) .86(.02) 56.4(1.3) 57.8(1.4) not large (as shown by CVNE).
MVEmb 1.3(.02) 1.0(.02) 42.4(1.0) 50.7(.31)
ParaEmb 1.0(.02) .80(.02) 59.6(1.4) 60.8(1.4) TABLE 9
Statistics of Baseline Sentiment Lexicons