Chinese-English Mixed Text Normalization
Chinese-English Mixed Text Normalization
ABSTRACT 1. INTRODUCTION
Along with the expansion of globalization, multilingualism has With the rapidly growing of Internet and the needs of glob-
become a popular social phenomenon. More than one language alization, exposure of individuals to multiple languages is be-
may occur in the context of a single conversation. This phe- coming increasingly frequent. It promotes needs for people to
nomenon is also prevalent in China. A huge variety of informal acquire additional languages. Multilingual speakers have even
Chinese texts contain English words, especially in emails, social outnumbered monolingual speakers [26]. Code-switching, which
media, and other user generated informal contents. Since most of is the use of more than one language in the context of a single
the existing natural language processing algorithms were designed conversation, occurs frequently especially in informal texts. Due
for processing monolingual information, mixed multilingual texts to the drastically increasing of social medias, the amount of
cannot be well analyzed by them. Hence, it is of critical importance user generated content (UGC) is extensively growing. Therefore,
to preprocess the mixed texts before applying other tasks. In this the mixed usages of more than one languages becomes a social
paper, we firstly analyze the phenomena of mixed usage of Chinese phenomenon.
and English in Chinese microblogs. Then, we detail the proposed In Chinese, among all kinds of informal language phenomena,
two-stage method for normalizing mixed texts. We propose to the mixed usage of Chinese and English is one the most frequent
use a noisy channel approach to translate in-vocabulary words into types. Through analyzing 210 million microblogs collected
Chinese. For better incorporating the historical information of from Sina Weibo1 , which is one of the most popular website
users, we introduce a novel user aware neural network language providing microblogging service in China, we find that over 14.8%
model. For the out-of-vocabulary words (such as pronunciations, microblogs contain at least one English word. Moreover, these
informal expressions and et al.), we propose to use a graph- English words include not only nouns but also adjectives, adverbs,
based unsupervised method to categorize them. Experimental and even verbs. For example, let us consider the following
results on a manually annotated microblog dataset demonstrate example:
the effectiveness of the proposed method. We also evaluate three
natural language parsers with and without using the proposed dd book ddddd
method as the preprocessing step. From the results, we can see (Please help me book a meeting room)
that the proposed method can significantly benefit other NLP tasks
in processing mixed text. The speaker uses “book” instead of its Chinese translation “ dd”
to express his meaning.
Since existing natural language processing techniques (e.g. POS
Categories and Subject Descriptors tagging, chunking, parsing, opinion mining, etc.) were designed
H.3.3 [Information Storage and Retrieval]: Information Search for processing monolingual text, multilingual mixed texts cannot
and Retrieval - Information Search and Retrieval be well processed by these methods. Moreover, due to lack
of annotated corpus for informal texts, the effectiveness of most
General Terms state-of-the-art supervised models are high impacted in processing
informal content. We evaluated the performances of Stanford
Algorithms, Experimentation. Parser and Berkeley Parser, which both of are widely used for
various applications, drop to 66.4% and 67.9% respectively in
Keywords processing the POS of English words in the Chinese-English
Words Normalization, Chinese-English Mixed Text, User Aware mixed microblogs. It also demonstrates the great importance of
Neural Network Language Model normalizing the mixed texts. Zhao et al. [36] have also noticed
this issue and proposed to use dynamic features under sequence
Permission to make digital or hard copies of all or part of this work for personal or labeling framework to achieve POS tagging problem for mixed
classroom use is granted without fee provided that copies are not made or distributed texts. However, their work only focused on a specific task, POS
for profit or commercial advantage and that copies bear this notice and the full citation tagging, and can not be directly adopted to process other tasks.
on the first page. Copyrights for components of this work owned by others than Moreover, creating training data for mixed texts or investigating
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or novel algorithms specifically for different NLP tasks are all time-
republish, to post on servers or to redistribute to lists, requires prior specific permission
consuming and sometimes difficult to accomplish. We argue that
and/or a fee. Request permissions from permissions@acm.org.
WSDM’14, February 24–28, 2014, New York, New York, USA. this kind of approaches can be easily generalized.
Copyright 2014 ACM 978-1-4503-2351-2/14/02 ...$15.00. 1
http://dx.doi.org/10.1145/2556195.2556228. http://www.weibo.com.
433
Several existing works have been proposed to achieve the task 2. RELATED WORK
from the perspective of normalizing general informal Chinese The research of text normalization can be traced back to the
texts. Li and Yarowsky [16] proposed to use bootstrapping task of converting numbers, dates, etc. into the standard dictionary
model to identify candidate informal phrases and use conditional words. Along with the rapid increasing of user generated content,
log-linear model based on rule-based intuitions and data co- text normalization task has received much more attentions in recent
occurrence phenomena to rank candidates. Wang and Ng [33] years [1, 13, 16, 2, 18, 14, 10, 6]. In this paper we classify the
introduced a beam-search decoder based normalization method related works into three categories: lexical normalization, named
for missing word recovery, punctuation correction, manually entity normalization and informal text processing.
assembled dictionary based word correction, and resegmentation.
However, these methods are mainly based on the assumption
of frequent occurrence. According to our statistic based on 2.1 Lexical Normalization
microblogs collected from real online service, the occurrence of Aw et al. [1] treated the lexical normalisation problem as a
English words in mixed texts also follow Zipf’s law. Therefore, translation problem from the informal language to formal English
lots of English words’ frequency in the mixed texts are low. These They also studied the differences among SMS normalization,
infrequent English words cannot be well translated or categorized general text normalization, spelling check and text paraphrasing.
by these methods. Based on the investigated phenomena of SMS messages, they
Although most Chinese-English mixed microblogs only contain adapted a phrase-based method to achieve the task.
a few of English words, this is still a very challenging task due Kobus et al. [13] studied the problem of normalizing the
to the following facts: 1) there are enormous number and various orthography of French SMS messages. They proposed machine
types of English words. According to our statistic, more than 149K translation based method and nondeterministic phonemic transduc-
distinct words are included in 2.6 million mixed microblogs. 2) tion based method.
the linguistic and syntactic usages of English words may different Han and Baldwin [8] proposed a supervised method to detect
from their original one. Due to that these mixed texts usually follow ill-formed words and used morphophonemic similarity to generate
the grammar of one language, Part-of-speech tags of some English correction candidates. Then, all candidates were ranked based on
words may even be changed. a number of features generated from noisy context and similarity
In this paper, we take a normalization centric view of processing between ill-formed words and candidates.
Chinese-English mixed texts. We propose a novel two-stage Liu et al. [18] proposed to use a broad coverage lexical
method to achieve the task. For in-vocabulary English words, normalization method consisting three components. They assumed
we propose to translate them into Chinese. Out-of-vocabulary that a set of letter transformation patterns were used by humans
words (including , pronunciations, informal expressions, etc.) to decipher the nonstandard tokens and integrated three human
are classified into different categories, such as person name, perspectives, including enhanced letter transformation, visual
organization name, and so on. With these steps, mixed texts priming, and string/phonetic similarity
can be processed by existing NLP methods with little additional Han et al. [9] introduced a dictionary based method and an
efforts. The normalized texts are much more easily for monolingual automatically normalisation dictionary construction method. They
speakers to understand. To the best of our knowledge, this is the assumed that lexical variants and their standard forms occur in
first work focused on normalizing Chinese-English mixed texts. similar contexts.
We propose to use noisy-channel approach with neural network Derczynski et al. [6] also proposed to use dictionary based
language model to translate in-vocabulary words. To capture method to achieve the task. They created a set of mappings from
the historical information of users, we propose a novel user OOV words to their IV equivalents, using slang dictionaries and
aware neural network language model. For training the word- manual examination of the training data.
level translation model, we constructed a parallel corpus based Wang and Ng [33] focused on the problem of missing word
on subtitles of movie and TV series. To categorize words, we recovery, punctuation correction, manually assembled dictionary
propose to use a graph-based unsupervised method with a novel based word correction, and resegmentation. They introduced a
initialization technique. For evaluating the proposed method, we beam-search decoder based normalization method to do it.
also manually constructed a labeled corpus. Experimental results Although these methods achieved significant improvement on
show that the proposed method achieves better performances than processing SMS and UGCs, they only focused on the monolingual
state-of-the-art methods. The main contributions of this paper are text. Hence, Chinese-English mixed text can not be directly
as follows: processed by these methods.
434
100% 1e+07
1e+06
80%
1e+05
60%
10000
40% 1000
20% 100
10
0%
1 2 3 4 5 6 7 8 9 10 1
# of English tokens 1 10 100 1000 10000 1e+05 1e+06 1e+07
Figure 1: Distribution of number of English tokens per Figure 2: English token frequency in Chinese-English mixed
microblog. microblogs.
Xie et al. [34] proposed to extract Chinese abbreviations and a distantly supervised approach based LabeledLDA for Named
their corresponding definitions based on anchor texts. They Entity Classification problem on tweets.
constructed a weighted bipartite graph from anchor texts and Zhao et al. [36] proposed to use dynamic features under sequence
applied co-frequency based measures to quantify the relatedness labeling framework to process POS tagging problem for Chinese-
between two anchor texts. English mixed texts. They extracted features from both local and
Li and Yarowsky [17] proposed an unsupervised method extract- non-local information and taked advantage of the characteristics of
ing the relation between a full-form phrase and its abbreviation the mixed texts.
from monolingual corpora. They used data co-occurrence intuition Previous works have been made to study the problem from
to identify relations between abbreviation and full names. They various aspects. However, most works focused on specific tasks.
also improved a statistical machine translation by incorporating the Different with them, in this paper, we take a normalization centric
extracted relations into the baseline translation system. view of processing Chinese-English mixed texts.
Based on the data co-occurrence phenomena, Li and Yarowsky [16]
also introduced a bootstrapping procedure to identify formal-
informal relations in web corpora. They used search engine to 3. DATA ANALYSIS
extract contextual instances of the given informal phrase, and For better understanding the phenomena of mixed usage of
ranked the candidate relation pairs using conditional log-linear Chinese and English, in this section, we examine the dataset which
model. contains about 210 millions microblogs crawled from Sina Weibo.
We firstly describe the analyzing results from raw dataset. Then, we
2.3 Informal Text Processing introduce the results acquired from manually categorized English
Despite the normalization based methods, a number of works words in these mixed texts. Since Chinese phonetic system and
have been proposed to directly process the informal texts [7, 19, some informal usages are also represented by English alphabet
20, 30, 27, 36]. letters, for clarification, we use English token to represent a
Freitag [7] studied the problem of performing information sequence of English alphabet letters without any blank in between.
extraction from informal text. They showed the strategies of Firstly, the microblogs which contain at least one English token
creating a term-space representation and exploiting typographic are extracted from the dataset. We observe that more than 14.8%
information in the form of token features. Minkov et al. [19] also microblogs are Chinese-English mixed texts. Figure 1 shows the
introduced methods for extracting named entitles from informal distribution of number of English tokens per microblog in the
texts and showed that informal text had different characteristics Chinese-English mixed microblogs. We can observe that more
from formal text. than 94.6% microblogs contain less than 3 English tokens. About
Mullen and Malouf [20] described statistical tests on a dataset of 78.8% mixed texts contain only one English token. It means that
political discussion group postings. They concluded that traditional most English tokens are surrounding by Chinese characters. Hence,
text classification methods would be inadequate to the task of the linguistic usages of these English tokens may be different from
sentiment analysis in this domain. their original ones. The part-of-speech tags of some tokens are even
Thelwall et al. [30] focused on the task of detecting sentiment changed.
strength from informal texts. Through experimental results, From the perspective of English token, we also look at the
they demonstrated incorporating decoding nonstandard spellings frequency of each token. Figure 2 shows a plot of English tokens
seemed to be one of factors of relative improvements given by their frequency . The plot is in log-log coordinates. X-axile is the rank
method. of a token in the frequency table. Y-axile is the total number of
Ritter et al. [27] experimental demonstrated that existing meth- the token’s occurrences. From the figure, we can observe that the
ods for POS tagging, Chunking and Named Entity Recognition frequency of English tokens also follow Zipf’s law. It means that
performed quite poorly for processing Tweets. They presented lots of tokens occur infrequently.
435
Table 1: Categories of English tokens in Chinese-English mixed texts. (English translations are in the brackets.)
Category Percent Example
a
Vocabulary word 68.3% ddddddmeeting (Please don’t forget tomorrow’s meeting).
Abbreviation 12.3% BBCddddddddd(Wild China produced by BBC).
Pronunciation 4.0% ddddweibo (Update several microblogs).
Slang 7.8% Orz (A posture emoticon representing a kneeling person).
Other 7.6% User ID, Chinese word followed by ing, misspelling, and so on.
a
The vocabulary is constructed based on the parallel corpus, which will be described in Section 4.1.
For investigating the types of these English tokens, we randomly information) pair. The scoring components are computed by two
selected 2,000 microblogs which contain at lest one English token neural networks.
and manually labeled categories of English tokens. The five Following the framework proposed by Collobert and Weston [4],
categories we use to classify queries are listed in Table 1. Table given a word sequence c and user historical information u, our
1 also shows examples and percentages of each category. From the goal is to discriminate the correct last word in c for other random
table, we can observe that vocabulary words and abbreviations take words. s(c, u) represents the scoring function modeled by neural
part in 68.3% and 12.3% among all English tokens respectively. networks. cw represents word sequence c with the last word
It means that the tokens which can be translated take great part in replaced by word w. Hence, the objective is that the margin
all mixed texts. Among all the five categories, tokens belonging to between s(c, u) and s(cw , u) is larger than 1, for any other word w
“Slang” and “Other” categories are most difficult to normalize. For in the vocabulary. The object function is to minimize the ranking
example, “Orz” is originated from Japan and can be used to express loss for for each (c, u) in the training corpus:
various meanings in different context.
Lc,u = max (0, 1 − s(c, u) + s(cw , u)) (2)
w∈V
4. THE PROPOSED METHOD
Firstly, the word sequence c = w1 w2 ....wn is represented
In this section, we describe the proposed normalization method
by an ordered list of vectors x = (x1 , x2 , ..., xn ) where xi is
for English tokens in Chinese-English mixed texts. For the tokens
the embedding of word i in the sequence. xi is a column in
which are vocabulary words, we propose to use noisy channel
the embedding matrix E ∈ Rm×|V | , in which |V | denotes the
model with word embeddings to translate them. For the out-
vocabulary size. The embedding matrix E will be learned and
of-vocabulary words and other types of English tokens, a graph-
updated during the training procedure. scorel is modeled by a
based unsupervised method is introduced to categorize them. The
neural network with one hidden layer:
following sections will describe the proposed methods.
a1 = f (W1 [x1 ; x2 ; ...; xn ] + b1 ) (3)
4.1 Word Translation scorel = W 2 a1 + b 2 , (4)
Let t represents the given mixed text. It contains a sequence of
words w1 w2 ....wn . Each word wi is either a Chinese word or an where f is an element-wise activation function such as tanh; a1 ∈
English word. For the mixed text t, the word translation method try Rh×1 is the activation of the hidden layer with h hidden nodes;
to produce the normalization candidate ĉ under the noisy channel W1 ∈ Rh×(mn) is the first layer weights of the neural network;
model, which contain two components: W2 ∈ R1×h is the second layer weights; b1 , b2 are the biases of
each layer.
• A language model assigns a probability p(c) for any sentence Following the work done by Huang et al.[11], for representing
c = w1 w2 ....wn in Chinese. user historical information, we also use the weighted average of all
embeddings of words belonging to the user historical information.
• A translation model assigns a conditional probability p(c|t) u denotes the user historical information vector and is calculated as
to any Chinese/Mixed-Text pair of sentences. follows:
m
Given these two components of the model, following the general f (wiu )xui
u = i=1 m u , (5)
noisy-channel approach, the output of the translation model on a i=1 w(wi )
Chinese-English mixed sentence t is: where w1u , w2u , ...wm
u
represents the words of user historical
ĉ = arg max p(c) × p(c|t), (1) information; xi denotes the embedding of wiu ; f (·) captures the
u
436
score
sum
user historical
scorel scoreu information
wi
wj
wp
Weighted average
...
... wi-3 wi-2 wi-1 wi ... user historical wn
information vector
u
nodes; W1u ∈ Rh ×(2m) u
is the first layer weights of the neural From analyzing the dataset, we observe that (I) words belongs
network; W2u ∈ R1×h is the second layer weights; bu1 , bu2 are the to the same categories tend to have similar contexts; (II) words
biases of each layer. and their corresponding category description words tend to have
The final score is the sum of score of local context and user frequent co-occurrence relations. Previous researches also show
historical information: that words with high context similarity tend to have similar
semantic meanings [9]. Based on the observation, we propose to
score = scorel + scoreu (8)
use context similarities to measure the edge weight of the graph.
For training the parameters: weights of the neural network and LP transfers labels from labeled data to unlabeled data through
the embedding matrix E, we also follow the corrupt example weighted graph. Based on the observation (II), we propose a novel
sampling method [4], and sample the gradient of the objective by label initialization method.
randomly choosing a word from the dictionary as a corrupt example
for each sequence-context pair, (t, u). These weights are updated 4.2.1 Graph Construction
via back-propagation. We construct an undirected graph G = (V, E) to represent the
For the translation model p(c|t), we make the following relations between English tokens. V = {v1 , ..., vn } denotes all
independence assumptions: the vertices in the graph. Vertices represent English tokens needed
to be categorized. E = {eij , 1 ≤ i, j ≤ n} represents the
p(c|t) = p(ci |ti ), (9) similarities between tokens. eij represents the similarity between
ti ∈Eng vertex vi and vj . In order to reduce the computation cost of
where p(c|t) is the probability of generating Chinese word c from iteratively propagating, we also exclude edges whose value is less
English word t and can be estimated by IBM Model 1 with parallel than threshold θ to prune the edges.
corpus. For calculating the similarities between tokens, we first extract
Based on Eq.(8) and Eq.(9), the translation model Eq.(1) on a all the context words in a predefined window (the windows size
new Chinese-English mixed sentence t can be reformulated as: used in this work is 4) for each token. Context words are treated
independently for each other. We use vector space model to
ĉ = arg max p(c) × p(c|t) (10) construct context vector f, of which each dimension is calculated
c∈C
by tf · idf . Cosine measure is used to calculate the similarity
∝ arg max score(t1 t2 ..ti , u)p(ci |ti )
c∈C between tokens:
i,ti ∈Eng
According to the statistic of the crawled corpus, we observe that fi · fj
eij = cos(fi , fj ) = (11)
more than 94.6% percent Chinese-English mixed texts contain less fi fj
than 3 English words. Hence, in this work, the decoding problem
can be efficiently solved. 4.2.2 Label Propagation
Label propagation method transfers the labels from labeled data
4.2 Word Categorization to unlabeled data based on the constructed weighted graph. It has
As described in Section 2, except in-vocabulary words, there are been successfully used for many tasks [21, 15, 32, 5, 29]. In this
also a number of English tokens used as product name, informal work, we also adopt LP to obtain the categories of English tokens.
expressions, and so on. For these tokens, we propose to use label From the observation (II), we know that words tend to have
propagation (LP) [37] to classify them into different categories. In frequent co-occurrence with their category description words. For
this work, we try to classify English tokens into the following five example, more than 1.98 billions documents can be retrieved using
categories: person name, product name, organization name, slang, the query “iphone product” through Google. While, only 25
and loanword. millions documents are returned using the query “iphone informal
437
expressions”. Based on the observation, we firstly construct a
number of description words for each category. cdwij represents Table 2: Word translation results of different methods.
jth description word of the ith category. Similar as SO-PMI- Accuracy (%)
Methods
IR [31], we propose to use the following equation to measure the Top-1 Top-5 Top-10
possibility of token vi in the category zj :
D-LM † 28.9 47.6 51.9
mhits(vi NEAR cdwjk )
SO-P(vi , zj ) = max , (12) D-NLM 29.5 48.9 52.0
k=1 hits(vi ) · hits(cdwjk )
where hits(query) is the number of hits given the query query D-NLM+U 31.2 50.9 52.6
and is returned by Bing; “NEAR” is used to restrict the distance PC-LM †
60.3 80.7 84.0
between search phrases2 .
The label distribution of each vertex is initialized as follows: PC-NLM 61.4 83.8 88.1
⎧ PC-NLM+U 64.6 86.2 91.5
⎪ 1 if vi ∈ Vs and vi ∈ Vsj
⎪
⎪
⎨
0 if vi ∈ Vs and vi ∈ / Vsj Li and Yarowsky[16] 21.2 33.6 37.5
qi0 (zj ) = , (13)
⎪
⎪ SO-P(vi , zj )
⎪
⎩ m otherwise Han et al.[9] 19.6 27.2 31.3
k=1 SO-P(vi , zk )
† The in-vocabulary words based on online dictionary and parallel corpus
where qik (i = 1...|V |) is the category distribution for vertex vi take part in 66.3% and 67.6% respectively among all English tokens.
after k propagation; qik (zj ) represents the weight of a category zj
in qik ; Vsj is the set of seed words category for category zj ; Vs =
Vs1 ∪ ... ∪ Vsm is the set of seed words of all categories. included as the golden standards. For the word categorization,
With the initialization weights, label propagation method is used annotators were also asked to label categories for every words. If a
to iteratively update qik through weighted edges. In each iteration, category were labeled by more than two annotators for a word, the
the probability propagation is also under the condition that edges category is selected as the standard label of the word.
with higher similarities allow easier propagation. Category
distributions for each vertex is updated as follows:
5.2 Experiment Configurations
⎧ 0 For training the word translation model described in Section
⎪
⎪ qi (zj ) if vi ∈ Vs 3.1, we also collected 24,853 subtitles of movies and TV series
⎨
from Shooter 3 . All these subtitles contain both Chinese and
vj ∈N (vi ) eij · qj
k k−1
qi (zj ) = (zj ) , (14)
⎪
⎪ otherwise English text in single or separated files. Using these subtitles, we
⎩ eij
vj ∈N (vi ) construct a parallel corpus, which contains more than 18.5 millions
sentence pairs. For training the neural network language model, we
where N (vi ) is the set of vertices linking to vi .
randomly sampled 10 millions microblogs, due to the computation
limited. We implement the proposed method based on the code of
5. EXPERIMENTS Huang et al. [11]4 .
In this section, we describe the experimental evaluations of the FudanNLP [25] is used for Chinese word segmentation. For
proposed method. Firstly, we describe the collections used for training the word translation model, we use Giza++ toolkit [22]
evaluation and experimental setups. Secondly, the performances with the parallel corpus we constructed. For constructing the
of word translation and categorization are presented respectively. similarity graph, we implemented it using Hadoop 1.2.0 to handle
Finally, we evaluate performances of three Chinese parsers with massive computation. We also incorporate the proposed method
and without the proposed method as preprocessing step. as the preprocess step for three parsers: Stanford Parser 3.2 [28],
Berkeley Parser 1.7 [12], and FudanNLP 1.57 [25].
5.1 Collection For evaluating word translation, we adopt word-level n-best
As described in Section 2, for analyzing the phenomena of accuracy to evaluate the proposed methods. For each English token,
Chinese-English mixed text, we collected 210 millions microblogs the output is considered as correct if any of the corresponding
from Sina Weibo. For evaluating the effectiveness of the proposed golden standard words is among the top-n returned results.
methods, we used a subset of them as testing data. We randomly Evaluation metrics used for word categorization throughout the
selected 1,000 microblogs from all Chinese-English mixed ones experiments include: Precision, Recall, and F1-score.
and manually labeled the translation or categories of all English
tokens in these texts. The testing data contains 1,195 English 5.3 Word Translation Results
tokens in total. For translation model, we compare the proposed parallel corpus
Three annotators were involved in the labeling task. Since most based method with dictionary5 based method.
mixed texts contain only a small number of English tokens, the
• “D” represents dictionary based method, where all the trans-
ambiguity problem is not serious. Annotators were firstly asked
lations given by the dictionary are selected as candidates.
to provide translations for all tokens. To evaluate the quality of
corpus, we validate the agreements of human annotations using • “PC” represents parallel corpus based method, where the
Cohen’s kappa coefficient. The average κ among all annotators translation probability is given by the toolkit Giza++.
is 0.646. It indicates that the annotations of the corpus are reliable. 3
Since some words have multiple translations, one of the annotator http://www.shooter.cn. It is one of the most popular websites
which provide subtitles with Chinese translations.
made the final decision to decide which translations should be 4
http://ai.stanford.edu/ ehhuang/
2 5
We use the advanced keywords of Bing “near:10” to implement In this paper, we use dict.cn as the dictionary. It is one of the
the NEAR constraint. biggest online dictionary website in China.
438
70% 70%
65% 65%
Top-1 Accuracy
Top-1 Accuracy
60% 60%
55% 55%
50% 50%
45%
45%
40%
40%
1 2 3 4 5 6 7 8 9 10
Size of Training data # of iteration
439
Table 3: Word categorization results of different methods.
Person Name Product Name Org. Name Slang Loanwords
Methods Acc.
P R F P R F P R F P R F P R F
LP 37.5 23.8 29.1 93.9 18.4 30.8 57.7 17.4 26.8 19.2 31.3 23.8 8.8 39.1 14.4 22.5
INIT 23.1 60.0 33.3 36.2 27.6 31.3 28.1 52.3 36.6 27.9 45.8 34.6 42.9 13.0 20.0 32.3
INIT+LP/WOS 80.3 10.0 17.2 84.9 36.8 51.4 34.4 89.5 49.7 40.3 56.3 47.0 50.0 4.4 8.0 42.9
INIT+LP/WS 90.9 19.2 31.7 86.4 58.6 69.8 39.0 84.9 53.4 48.7 75.0 59.0 57.1 17.4 26.7 55.8
60%
Table 4: The perfomances of POS tagging of English
tokens by different parsers with/without the proposed method.
55% “WP” and “WoP” represent the accuracy with and without
normalization method respectively.
Accuracy
Methods WoP WP
50%
Stanford Parser 3.2 66.4% 84.7%
Berkeley Parser 1.7 67.9% 83.9%
45%
FudanNLP 1.57 54.0% 79.6%
40%
0 1 2 3 4 5 6 7 8 9 10 of seed words per category is more than 4 words, the performances
# seed words per category increase slowly. We think that lack of context information is one of
the main reasons.
Figure 6: The impact of number of seed words per category.
5.5 Applications
To show the effectiveness of the proposed method as the
preprocessing step for other NLP tasks, we evaluate our method
For the methods INIT+LP/WS and LP, 4 seed words are used for with three Chinese parsers. In this work, we only focus on the
each category. The total number of out-of-vocabulary words are relative changes caused by English tokens. We randomly select
387, which takes about 32.3% percents among all English tokens. 100 microblogs from the whole evaluation set. The in-vocabulary
From the results, we can observe that the propose method words are translated into Chinese. The words which cannot be
(INIT+LP/WS) achieves the best performances in four of the translated are replaced by their category names.
five categories. The accuracy of the proposed method is also Table 4 shows the accuracy of POS tagging of English tokens
significantly better than other methods. Comparing the results with with and without the proposed normalization method. From the
and without the proposed initialization method (INIT+LP/WS v.s. results, we can observe that all three parsers benefit a lot from the
LP), we can observe that initialization contributes a lot. The relative proposed normalization method in processing mixed texts. The
accuracy improvement is more than 148%. It demonstrates that relative improvements are significant. Since features extracted
the proposed method can achieve better performance with a small from words play important rules in existing methods, lack of
number of seed words. word information may highly impact their performances. For the
From the Table 3, we also observe that all the methods cannot methods which label all English tokens with a same tag [24], the
achieve satisfactory performances for the categories of person proposed method can bring much more benefit for them.
name and loanwords. We analyze the errors in these categories
and find that lots of errors are caused by acronyms. Person
names may be represented by acronyms. However, some of them 6. CONCLUSIONS
are also used as the organization name. Since the number of In this paper, we focus on the task of normalizing the Chinese-
webpages about a organization is usually larger than the number English mixed texts. We firstly analyzed the phenomena of mixed
of webpages describing a person, most of them are classified into usage in Chinese. Then, we propose to use word translation and
organization category. This is also one of the main reason of categorization to achieve the task. For word translation, we use
why all the methods achieve low precision in organization name noisy-channel approach with neural network language model to
category. These acronyms cannot be correctly classified without translate in-vocabulary words. A novel user aware neural network
context information. language model is introduced to capture the useful historical
To show the performance impact of the number of seed words, information of users. For categorizing words, a graph-based
we evaluate the accuracy of word categorization method with unsupervised method is proposed . We also introduce a novel
different number of seed words. The results are shown in Fig. 6. initialization technique to improve the effectiveness. Experimental
From the figure, we can observe that the accuracy is significantly results show that the top-5 word translation accuracy of proposed
improved with seed words comparing to the method without seed method achieves 86.2%. For word categorization, the propose
words. The accuracies are improving continuously along with the method also achieves significant improvement over the baseline
increasing the number of seed words. Although, when the number methods. The relative improvement over the original label
440
propagation method is more than 148%. We also incorporate the Stroudsburg, PA, USA, 2012. Association for Computational
proposed method as preprocessing steps for three different parsers. Linguistics.
All of them benefit a lot from the proposed method. [10] B. Han, P. Cook, and T. Baldwin. Lexical normalization for
social media text. ACM Trans. Intell. Syst. Technol.,
7. ACKNOWLEDGEMENT 4(1):5:1–5:27, Feb. 2013.
The authors wish to thank the anonymous reviewers for their [11] E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng.
helpful comments. This work was partially funded by National Improving word representations via global context and
Natural Science Foundation of China (61003092, 61073069), multiple word prototypes. In Proceedings of the 50th Annual
National Major Science and Technology Special Project of China Meeting of the Association for Computational Linguistics:
(2014ZX03006005), Shanghai Municipal Science and Technol- Long Papers - Volume 1, ACL ’12, pages 873–882,
ogy Commission (No.12511504502), Key Projects in the Na- Stroudsburg, PA, USA, 2012. Association for Computational
tional Science & Technology Pillar Program(2012BAH18B01) Linguistics.
and “Chen Guang” project supported by Shanghai Municipal [12] M. Johnson and A. E. Ural. Reranking the berkeley and
Education Commission and Shanghai Education Development brown parsers. In Human Language Technologies: The 2010
Foundation(11CG05). Annual Conference of the North American Chapter of the
Association for Computational Linguistics, pages 665–668,
8. REFERENCES Los Angeles, California, June 2010. Association for
Computational Linguistics.
[1] A. Aw, M. Zhang, J. Xiao, and J. Su. A phrase-based
statistical model for sms text normalization. In Proceedings [13] C. Kobus, F. Yvon, and G. Damnati. Normalizing sms: are
of the COLING/ACL 2006 Main Conference Poster Sessions, two metaphors better than one? In Proceedings of the 22nd
pages 33–40, Sydney, Australia, July 2006. Association for International Conference on Computational Linguistics -
Computational Linguistics. Volume 1, COLING ’08, pages 441–448, Stroudsburg, PA,
USA, 2008. Association for Computational Linguistics.
[2] R. Beaufort, S. Roekhaut, L.-A. Cougnon, and C. Fairon. A
hybrid rule/model-based finite-state framework for [14] C. Li and Y. Liu. Improving text normalization using
normalizing sms messages. In Proceedings of the 48th character-blocks based models and system combination. In
Annual Meeting of the Association for Computational Proceedings of COLING 2012, pages 1587–1602, Mumbai,
Linguistics, ACL ’10, pages 770–779, Stroudsburg, PA, India, December 2012. The COLING 2012 Organizing
USA, 2010. Association for Computational Linguistics. Committee.
[3] J.-S. Chang and W.-L. Teng. Mining atomic chinese [15] X. Li, Y.-Y. Wang, and A. Acero. Learning query intent from
abbreviations with a probabilistic single character recovery regularized click graphs. In Proceedings of the 31st annual
model. Language Resources and Evaluation, international ACM SIGIR conference on Research and
40(3-4):367–374, 2006. development in information retrieval, SIGIR ’08, pages
339–346, New York, NY, USA, 2008. ACM.
[4] R. Collobert and J. Weston. A unified architecture for natural
language processing: deep neural networks with multitask [16] Z. Li and D. Yarowsky. Mining and modeling relations
learning. In Proceedings of the 25th international conference between formal and informal chinese phrases from web
on Machine learning, ICML ’08, pages 160–167, New York, corpora. In Proceedings of the Conference on Empirical
NY, USA, 2008. ACM. Methods in Natural Language Processing, EMNLP ’08,
[5] D. Das and S. Petrov. Unsupervised part-of-speech tagging pages 1031–1040, Stroudsburg, PA, USA, 2008. Association
with bilingual graph-based projections. In Proceedings of the for Computational Linguistics.
49th Annual Meeting of the Association for Computational [17] Z. Li and D. Yarowsky. Unsupervised translation induction
Linguistics: Human Language Technologies, pages 600–609, for chinese abbreviations using monolingual corpora. In
Portland, Oregon, USA, June 2011. Association for Proceedings of ACL-08: HLT, pages 425–433, Columbus,
Computational Linguistics. Ohio, June 2008. Association for Computational Linguistics.
[6] L. Derczynski, A. Ritter, S. Clark, and K. Bontcheva. Twitter [18] F. Liu, F. Weng, and X. Jiang. A broad-coverage
part-of-speech tagging for all: Overcoming sparse and noisy normalization system for social media language. In
data. In Proceedings of the International Conference on Proceedings of the 50th Annual Meeting of the Association
Recent Advances in Natural Language Processing. for Computational Linguistics: Long Papers - Volume 1,
Association for Computational Linguistics, 2013. ACL ’12, pages 1035–1044, Stroudsburg, PA, USA, 2012.
[7] D. Freitag. Machine learning for information extraction in Association for Computational Linguistics.
informal domains. Machine Learning, 39(2-3):169–202, [19] E. Minkov, R. C. Wang, and W. W. Cohen. Extracting
2000. personal names from email: applying named entity
[8] B. Han and T. Baldwin. Lexical normalisation of short text recognition to informal text. In Proceedings of the
messages: Makn sens a #twitter. In Proceedings of the 49th conference on Human Language Technology and Empirical
Annual Meeting of the Association for Computational Methods in Natural Language Processing, HLT ’05, pages
Linguistics: Human Language Technologies, pages 368–378, 443–450, Stroudsburg, PA, USA, 2005. Association for
Portland, Oregon, USA, June 2011. Association for Computational Linguistics.
Computational Linguistics. [20] T. Mullen and R. Malouf. A preliminary investigation into
[9] B. Han, P. Cook, and T. Baldwin. Automatically constructing sentiment analysis of informal political discourse. In
a normalisation dictionary for microblogs. In Proceedings of Proceedings of AAAI-2006 Spring Symposium on
the 2012 Joint Conference on Empirical Methods in Natural Computational Approaches to Analyzing Weblogs, 2006.
Language Processing and Computational Natural Language [21] Z.-Y. Niu, D.-H. Ji, and C. L. Tan. Word sense
Learning, EMNLP-CoNLL ’12, pages 421–432, disambiguation using label propagation based
441
semi-supervised learning. In Proceedings of the 43rd Annual [30] M. Thelwall, K. Buckley, G. Paltoglou, D. Cai, and
Meeting on Association for Computational Linguistics, ACL A. Kappas. Sentiment strength detection in short informal
’05, pages 395–402, Stroudsburg, PA, USA, 2005. text. Journal of the American Society for Information
Association for Computational Linguistics. Science and Technology, 61(12):2544–2558, 2010.
[22] F. J. Och and H. Ney. A systematic comparison of various [31] P. D. Turney and M. L. Littman. Unsupervised learning of
statistical alignment models. Comput. Linguist., semantic orientation from a hundred-billion-word corpus.
29(1):19–51, Mar. 2003. (No. ERB-1094, NRC #44929): National Research Council
[23] N. Okazaki, M. Ishizuka, and J. Tsujii. A discriminative of Canada, 2002.
approach to japanese abbreviation extraction. In Proceedings [32] L. Velikovich, S. Blair-Goldensohn, K. Hannan, and
of the Third International Joint Conference on Natural R. McDonald. The viability of web-derived polarity lexicons.
Language Processing (IJCNLP 2008), pages 889–894, 2008. In Human Language Technologies: The 2010 Annual
[24] X. Qian, Q. Zhang, X. Huang, and L. Wu. 2d trie for fast Conference of the North American Chapter of the
parsing. In Proceedings of the 23rd International Conference Association for Computational Linguistics, HLT ’10, pages
on Computational Linguistics, COLING ’10, pages 904–912, 777–785, Stroudsburg, PA, USA, 2010. Association for
Stroudsburg, PA, USA, 2010. Association for Computational Computational Linguistics.
Linguistics. [33] P. Wang and H. T. Ng. A beam-search decoder for
[25] X. Qiu, Q. Zhang, and X. Huang. Fudannlp: A toolkit for normalization of social media text with application to
chinese natural language processing. In Proceedings of ACL, machine translation. In Proceedings of the 2013 Conference
2013. of the North American Chapter of the Association for
[26] G. Richard. A global perspective on bilingualism and Computational Linguistics: Human Language Technologies,
bilingual education. Georgetown University Round Table on pages 471–481, Atlanta, Georgia, June 2013. Association for
Languages and Linguistics 1999: Language in Our Time: Computational Linguistics.
Bilingual Education and Official English, Ebonics and [34] L.-X. Xie, Y.-B. Zheng, Z.-Y. Liu, M.-S. Sun, and C.-H.
Standard English, Immigration and the Unz Initiative Wang. Extracting chinese abbreviation-definition pairs from
Languages and Linguistics 1999, page 332, 2001. anchor texts. In Machine Learning and Cybernetics
[27] A. Ritter, S. Clark, Mausam, and O. Etzioni. Named entity (ICMLC), volume 4, pages 1485–1491, 2011.
recognition in tweets: an experimental study. In Proceedings [35] D. Yang, Y.-C. Pan, and S. Furui. Vocabulary expansion
of the Conference on Empirical Methods in Natural through automatic abbreviation generation for chinese voice
Language Processing, EMNLP ’11, pages 1524–1534, search. Computer Speech & Language, 26(5):321 – 335,
Stroudsburg, PA, USA, 2011. Association for Computational 2012.
Linguistics. [36] J. Zhao, X. Qiu, S. Zhang, F. Ji, and X. Huang.
[28] R. Socher, J. Bauer, C. D. Manning, and A. Y. Ng. Parsing Part-of-speech tagging for chinese-english mixed texts with
with compositional vector grammars. In Proceedings of ACL dynamic features. In Proceedings of the 2012 Joint
2013, June 2013. Conference on Empirical Methods in Natural Language
[29] A. Tamura, T. Watanabe, and E. Sumita. Bilingual lexicon Processing and Computational Natural Language Learning,
extraction from comparable corpora using label propagation. EMNLP-CoNLL ’12, pages 1379–1388, Stroudsburg, PA,
In Proceedings of the 2012 Joint Conference on Empirical USA, 2012. Association for Computational Linguistics.
Methods in Natural Language Processing and [37] X. Zhu and Z. Ghahramani. Learning from Labeled and
Computational Natural Language Learning, Unlabeled Data with Label Propagation. In Technical Report
EMNLP-CoNLL ’12, pages 24–36, Stroudsburg, PA, USA, CMU-CALD-02-107. Carnegie Mellon University, 2002.
2012. Association for Computational Linguistics.
442