0% found this document useful (0 votes)

34 views10 pages

Chinese-English Mixed Text Normalization

Uploaded by

rickshark

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views10 pages

Chinese-English Mixed Text Normalization

Uploaded by

rickshark

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Chinese-English Mixed Text Normalization

Qi Zhang, Huan Chen, Xuanjing Huang

Shanghai Key Laboratory of Intelligent Information Processing
School of Computer Science, Fudan University
825 Zhangheng Road, Shanghai, P.R.China
{qz, 12210240054, xjhuang}@fudan.edu.cn

ABSTRACT 1. INTRODUCTION
Along with the expansion of globalization, multilingualism has With the rapidly growing of Internet and the needs of glob-
become a popular social phenomenon. More than one language alization, exposure of individuals to multiple languages is be-
may occur in the context of a single conversation. This phe- coming increasingly frequent. It promotes needs for people to
nomenon is also prevalent in China. A huge variety of informal acquire additional languages. Multilingual speakers have even
Chinese texts contain English words, especially in emails, social outnumbered monolingual speakers [26]. Code-switching, which
media, and other user generated informal contents. Since most of is the use of more than one language in the context of a single
the existing natural language processing algorithms were designed conversation, occurs frequently especially in informal texts. Due
for processing monolingual information, mixed multilingual texts to the drastically increasing of social medias, the amount of
cannot be well analyzed by them. Hence, it is of critical importance user generated content (UGC) is extensively growing. Therefore,
to preprocess the mixed texts before applying other tasks. In this the mixed usages of more than one languages becomes a social
paper, we firstly analyze the phenomena of mixed usage of Chinese phenomenon.
and English in Chinese microblogs. Then, we detail the proposed In Chinese, among all kinds of informal language phenomena,
two-stage method for normalizing mixed texts. We propose to the mixed usage of Chinese and English is one the most frequent
use a noisy channel approach to translate in-vocabulary words into types. Through analyzing 210 million microblogs collected
Chinese. For better incorporating the historical information of from Sina Weibo1 , which is one of the most popular website
users, we introduce a novel user aware neural network language providing microblogging service in China, we find that over 14.8%
model. For the out-of-vocabulary words (such as pronunciations, microblogs contain at least one English word. Moreover, these
informal expressions and et al.), we propose to use a graph- English words include not only nouns but also adjectives, adverbs,
based unsupervised method to categorize them. Experimental and even verbs. For example, let us consider the following
results on a manually annotated microblog dataset demonstrate example:
the effectiveness of the proposed method. We also evaluate three
natural language parsers with and without using the proposed dd book ddddd
method as the preprocessing step. From the results, we can see (Please help me book a meeting room)
that the proposed method can significantly benefit other NLP tasks
in processing mixed text. The speaker uses “book” instead of its Chinese translation “ dd”
to express his meaning.
Since existing natural language processing techniques (e.g. POS
Categories and Subject Descriptors tagging, chunking, parsing, opinion mining, etc.) were designed
H.3.3 [Information Storage and Retrieval]: Information Search for processing monolingual text, multilingual mixed texts cannot
and Retrieval - Information Search and Retrieval be well processed by these methods. Moreover, due to lack
of annotated corpus for informal texts, the effectiveness of most
General Terms state-of-the-art supervised models are high impacted in processing
informal content. We evaluated the performances of Stanford
Algorithms, Experimentation. Parser and Berkeley Parser, which both of are widely used for
various applications, drop to 66.4% and 67.9% respectively in
Keywords processing the POS of English words in the Chinese-English
Words Normalization, Chinese-English Mixed Text, User Aware mixed microblogs. It also demonstrates the great importance of
Neural Network Language Model normalizing the mixed texts. Zhao et al. [36] have also noticed
this issue and proposed to use dynamic features under sequence
Permission to make digital or hard copies of all or part of this work for personal or labeling framework to achieve POS tagging problem for mixed
classroom use is granted without fee provided that copies are not made or distributed texts. However, their work only focused on a specific task, POS
for profit or commercial advantage and that copies bear this notice and the full citation tagging, and can not be directly adopted to process other tasks.
on the first page. Copyrights for components of this work owned by others than Moreover, creating training data for mixed texts or investigating
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or novel algorithms specifically for different NLP tasks are all time-
republish, to post on servers or to redistribute to lists, requires prior specific permission
consuming and sometimes difficult to accomplish. We argue that
and/or a fee. Request permissions from permissions@acm.org.
WSDM’14, February 24–28, 2014, New York, New York, USA. this kind of approaches can be easily generalized.
Copyright 2014 ACM 978-1-4503-2351-2/14/02 ...$15.00. 1
http://dx.doi.org/10.1145/2556195.2556228. http://www.weibo.com.

433
Several existing works have been proposed to achieve the task 2. RELATED WORK
from the perspective of normalizing general informal Chinese The research of text normalization can be traced back to the
texts. Li and Yarowsky [16] proposed to use bootstrapping task of converting numbers, dates, etc. into the standard dictionary
model to identify candidate informal phrases and use conditional words. Along with the rapid increasing of user generated content,
log-linear model based on rule-based intuitions and data co- text normalization task has received much more attentions in recent
occurrence phenomena to rank candidates. Wang and Ng [33] years [1, 13, 16, 2, 18, 14, 10, 6]. In this paper we classify the
introduced a beam-search decoder based normalization method related works into three categories: lexical normalization, named
for missing word recovery, punctuation correction, manually entity normalization and informal text processing.
assembled dictionary based word correction, and resegmentation.
However, these methods are mainly based on the assumption
of frequent occurrence. According to our statistic based on 2.1 Lexical Normalization
microblogs collected from real online service, the occurrence of Aw et al. [1] treated the lexical normalisation problem as a
English words in mixed texts also follow Zipf’s law. Therefore, translation problem from the informal language to formal English
lots of English words’ frequency in the mixed texts are low. These They also studied the differences among SMS normalization,
infrequent English words cannot be well translated or categorized general text normalization, spelling check and text paraphrasing.
by these methods. Based on the investigated phenomena of SMS messages, they
Although most Chinese-English mixed microblogs only contain adapted a phrase-based method to achieve the task.
a few of English words, this is still a very challenging task due Kobus et al. [13] studied the problem of normalizing the
to the following facts: 1) there are enormous number and various orthography of French SMS messages. They proposed machine
types of English words. According to our statistic, more than 149K translation based method and nondeterministic phonemic transduc-
distinct words are included in 2.6 million mixed microblogs. 2) tion based method.
the linguistic and syntactic usages of English words may different Han and Baldwin [8] proposed a supervised method to detect
from their original one. Due to that these mixed texts usually follow ill-formed words and used morphophonemic similarity to generate
the grammar of one language, Part-of-speech tags of some English correction candidates. Then, all candidates were ranked based on
words may even be changed. a number of features generated from noisy context and similarity
In this paper, we take a normalization centric view of processing between ill-formed words and candidates.
Chinese-English mixed texts. We propose a novel two-stage Liu et al. [18] proposed to use a broad coverage lexical
method to achieve the task. For in-vocabulary English words, normalization method consisting three components. They assumed
we propose to translate them into Chinese. Out-of-vocabulary that a set of letter transformation patterns were used by humans
words (including , pronunciations, informal expressions, etc.) to decipher the nonstandard tokens and integrated three human
are classified into different categories, such as person name, perspectives, including enhanced letter transformation, visual
organization name, and so on. With these steps, mixed texts priming, and string/phonetic similarity
can be processed by existing NLP methods with little additional Han et al. [9] introduced a dictionary based method and an
efforts. The normalized texts are much more easily for monolingual automatically normalisation dictionary construction method. They
speakers to understand. To the best of our knowledge, this is the assumed that lexical variants and their standard forms occur in
first work focused on normalizing Chinese-English mixed texts. similar contexts.
We propose to use noisy-channel approach with neural network Derczynski et al. [6] also proposed to use dictionary based
language model to translate in-vocabulary words. To capture method to achieve the task. They created a set of mappings from
the historical information of users, we propose a novel user OOV words to their IV equivalents, using slang dictionaries and
aware neural network language model. For training the word- manual examination of the training data.
level translation model, we constructed a parallel corpus based Wang and Ng [33] focused on the problem of missing word
on subtitles of movie and TV series. To categorize words, we recovery, punctuation correction, manually assembled dictionary
propose to use a graph-based unsupervised method with a novel based word correction, and resegmentation. They introduced a
initialization technique. For evaluating the proposed method, we beam-search decoder based normalization method to do it.
also manually constructed a labeled corpus. Experimental results Although these methods achieved significant improvement on
show that the proposed method achieves better performances than processing SMS and UGCs, they only focused on the monolingual
state-of-the-art methods. The main contributions of this paper are text. Hence, Chinese-English mixed text can not be directly
as follows: processed by these methods.

• We formalize the English word normalization problem in

Chinese-English mixed text. At the extent of the authors
2.2 Named Entity Normalization
knowledge, it is the first work focused on this topic. Normalizing named entity abbreviations to their standard forms
is also an important preprocessing task for UGCs. This task has
• To incorporate the historical information of users, we also attracted lots of attentions [3, 23, 17, 16, 34, 35].
propose a novel user aware neural network language model. Chang and Teng [3] introduced an HMM-based single character
recovery model to extract character level abbreviation pairs for tex-
• We manually labeled a number of microblogs extracted from tual corpus. Okazaki et al. [23] also used discriminative approach
real-world microblogs for evaluation. for this task. They formalized the abbreviation recognition task as
a binary classification problem and used Support Vector Machines
The remaining parts are organized as follows: Section 2 gives to model it.
some brief descriptions of related works. In section 3, we describe Yang et al. [35] treated the abbreviation generation problem as
the phenomena and analysis of mixed texts. Section 4 describes a labeling task and used Conditional Random Fields (CRFs) to do
the proposed normalization methods. Experimental results and it. They also proposed to rerank candidates by a length model and
analyses are given in Section 5. Section 6 concludes the paper. web information.

434
100% 1e+07

1e+06
80%
1e+05
60%
10000
40% 1000

20% 100

10
0%
1 2 3 4 5 6 7 8 9 10 1
# of English tokens 1 10 100 1000 10000 1e+05 1e+06 1e+07

Figure 1: Distribution of number of English tokens per Figure 2: English token frequency in Chinese-English mixed
microblog. microblogs.

Xie et al. [34] proposed to extract Chinese abbreviations and a distantly supervised approach based LabeledLDA for Named
their corresponding definitions based on anchor texts. They Entity Classification problem on tweets.
constructed a weighted bipartite graph from anchor texts and Zhao et al. [36] proposed to use dynamic features under sequence
applied co-frequency based measures to quantify the relatedness labeling framework to process POS tagging problem for Chinese-
between two anchor texts. English mixed texts. They extracted features from both local and
Li and Yarowsky [17] proposed an unsupervised method extract- non-local information and taked advantage of the characteristics of
ing the relation between a full-form phrase and its abbreviation the mixed texts.
from monolingual corpora. They used data co-occurrence intuition Previous works have been made to study the problem from
to identify relations between abbreviation and full names. They various aspects. However, most works focused on specific tasks.
also improved a statistical machine translation by incorporating the Different with them, in this paper, we take a normalization centric
extracted relations into the baseline translation system. view of processing Chinese-English mixed texts.
Based on the data co-occurrence phenomena, Li and Yarowsky [16]
also introduced a bootstrapping procedure to identify formal-
informal relations in web corpora. They used search engine to 3. DATA ANALYSIS
extract contextual instances of the given informal phrase, and For better understanding the phenomena of mixed usage of
ranked the candidate relation pairs using conditional log-linear Chinese and English, in this section, we examine the dataset which
model. contains about 210 millions microblogs crawled from Sina Weibo.
We firstly describe the analyzing results from raw dataset. Then, we
2.3 Informal Text Processing introduce the results acquired from manually categorized English
Despite the normalization based methods, a number of works words in these mixed texts. Since Chinese phonetic system and
have been proposed to directly process the informal texts [7, 19, some informal usages are also represented by English alphabet
20, 30, 27, 36]. letters, for clarification, we use English token to represent a
Freitag [7] studied the problem of performing information sequence of English alphabet letters without any blank in between.
extraction from informal text. They showed the strategies of Firstly, the microblogs which contain at least one English token
creating a term-space representation and exploiting typographic are extracted from the dataset. We observe that more than 14.8%
information in the form of token features. Minkov et al. [19] also microblogs are Chinese-English mixed texts. Figure 1 shows the
introduced methods for extracting named entitles from informal distribution of number of English tokens per microblog in the
texts and showed that informal text had different characteristics Chinese-English mixed microblogs. We can observe that more
from formal text. than 94.6% microblogs contain less than 3 English tokens. About
Mullen and Malouf [20] described statistical tests on a dataset of 78.8% mixed texts contain only one English token. It means that
political discussion group postings. They concluded that traditional most English tokens are surrounding by Chinese characters. Hence,
text classification methods would be inadequate to the task of the linguistic usages of these English tokens may be different from
sentiment analysis in this domain. their original ones. The part-of-speech tags of some tokens are even
Thelwall et al. [30] focused on the task of detecting sentiment changed.
strength from informal texts. Through experimental results, From the perspective of English token, we also look at the
they demonstrated incorporating decoding nonstandard spellings frequency of each token. Figure 2 shows a plot of English tokens
seemed to be one of factors of relative improvements given by their frequency . The plot is in log-log coordinates. X-axile is the rank
method. of a token in the frequency table. Y-axile is the total number of
Ritter et al. [27] experimental demonstrated that existing meth- the token’s occurrences. From the figure, we can observe that the
ods for POS tagging, Chunking and Named Entity Recognition frequency of English tokens also follow Zipf’s law. It means that
performed quite poorly for processing Tweets. They presented lots of tokens occur infrequently.

435
Table 1: Categories of English tokens in Chinese-English mixed texts. (English translations are in the brackets.)
Category Percent Example
a
Vocabulary word 68.3% ddddddmeeting (Please don’t forget tomorrow’s meeting).
Abbreviation 12.3% BBCddddddddd(Wild China produced by BBC).
Pronunciation 4.0% ddddweibo (Update several microblogs).
Slang 7.8% Orz (A posture emoticon representing a kneeling person).
Other 7.6% User ID, Chinese word followed by ing, misspelling, and so on.
a
The vocabulary is constructed based on the parallel corpus, which will be described in Section 4.1.

For investigating the types of these English tokens, we randomly information) pair. The scoring components are computed by two
selected 2,000 microblogs which contain at lest one English token neural networks.
and manually labeled categories of English tokens. The five Following the framework proposed by Collobert and Weston [4],
categories we use to classify queries are listed in Table 1. Table given a word sequence c and user historical information u, our
1 also shows examples and percentages of each category. From the goal is to discriminate the correct last word in c for other random
table, we can observe that vocabulary words and abbreviations take words. s(c, u) represents the scoring function modeled by neural
part in 68.3% and 12.3% among all English tokens respectively. networks. cw represents word sequence c with the last word
It means that the tokens which can be translated take great part in replaced by word w. Hence, the objective is that the margin
all mixed texts. Among all the five categories, tokens belonging to between s(c, u) and s(cw , u) is larger than 1, for any other word w
“Slang” and “Other” categories are most difficult to normalize. For in the vocabulary. The object function is to minimize the ranking
example, “Orz” is originated from Japan and can be used to express loss for for each (c, u) in the training corpus:
various meanings in different context.
Lc,u = max (0, 1 − s(c, u) + s(cw , u)) (2)
w∈V
4. THE PROPOSED METHOD
Firstly, the word sequence c = w1 w2 ....wn is represented
In this section, we describe the proposed normalization method
by an ordered list of vectors x = (x1 , x2 , ..., xn ) where xi is
for English tokens in Chinese-English mixed texts. For the tokens
the embedding of word i in the sequence. xi is a column in
which are vocabulary words, we propose to use noisy channel
the embedding matrix E ∈ Rm×|V | , in which |V | denotes the
model with word embeddings to translate them. For the out-
vocabulary size. The embedding matrix E will be learned and
of-vocabulary words and other types of English tokens, a graph-
updated during the training procedure. scorel is modeled by a
based unsupervised method is introduced to categorize them. The
neural network with one hidden layer:
following sections will describe the proposed methods.
a1 = f (W1 [x1 ; x2 ; ...; xn ] + b1 ) (3)
4.1 Word Translation scorel = W 2 a1 + b 2 , (4)
Let t represents the given mixed text. It contains a sequence of
words w1 w2 ....wn . Each word wi is either a Chinese word or an where f is an element-wise activation function such as tanh; a1 ∈
English word. For the mixed text t, the word translation method try Rh×1 is the activation of the hidden layer with h hidden nodes;
to produce the normalization candidate ĉ under the noisy channel W1 ∈ Rh×(mn) is the first layer weights of the neural network;
model, which contain two components: W2 ∈ R1×h is the second layer weights; b1 , b2 are the biases of
each layer.
• A language model assigns a probability p(c) for any sentence Following the work done by Huang et al.[11], for representing
c = w1 w2 ....wn in Chinese. user historical information, we also use the weighted average of all
embeddings of words belonging to the user historical information.
• A translation model assigns a conditional probability p(c|t) u denotes the user historical information vector and is calculated as
to any Chinese/Mixed-Text pair of sentences. follows:
m
Given these two components of the model, following the general f (wiu )xui
u = i=1 m u , (5)
noisy-channel approach, the output of the translation model on a i=1 w(wi )
Chinese-English mixed sentence t is: where w1u , w2u , ...wm
u
represents the words of user historical
ĉ = arg max p(c) × p(c|t), (1) information; xi denotes the embedding of wiu ; f (·) captures the
u

c∈C importance of the given word wi . In this paper idf-weighting is

where C is the set of all sentences in Chinese. used as the weighting function.
For language model, motivated by recent great success achieved We also use a neural network with one hidden layer to compute
by neural language models [11], we also incorporate it in this work. the user historical information score, scoreu as follows:
To better capture the historical information of users, we propose au1 = f (W1u [u; xn ] + bu1 ) (6)
a novel user-aware neural language model. The proposed model
scoreu = W2u au1 + bu2 , (7)
learns to discriminate the next word given a short word sequence
(local context) and sentences the user wrote recently (user historical where [u; xn ] is the concatenation of the weighted average user
information). As shown in Figure 3, two scoring components are historical uinformation vector and the vector of the last word in t;
defined for the final score of a (word sequence, user historical au1 ∈ Rh ×1 is the activation of the hidden layer with hu hidden

436
score

sum
user historical
scorel scoreu information

Weighted average

...
... wi-3 wi-2 wi-1 wi ... user historical wn
information vector

Figure 3: Overview structure of the proposed user-aware neural language model.

u
nodes; W1u ∈ Rh ×(2m) u
is the first layer weights of the neural From analyzing the dataset, we observe that (I) words belongs
network; W2u ∈ R1×h is the second layer weights; bu1 , bu2 are the to the same categories tend to have similar contexts; (II) words
biases of each layer. and their corresponding category description words tend to have
The final score is the sum of score of local context and user frequent co-occurrence relations. Previous researches also show
historical information: that words with high context similarity tend to have similar
semantic meanings [9]. Based on the observation, we propose to
score = scorel + scoreu (8)
use context similarities to measure the edge weight of the graph.
For training the parameters: weights of the neural network and LP transfers labels from labeled data to unlabeled data through
the embedding matrix E, we also follow the corrupt example weighted graph. Based on the observation (II), we propose a novel
sampling method [4], and sample the gradient of the objective by label initialization method.
randomly choosing a word from the dictionary as a corrupt example
for each sequence-context pair, (t, u). These weights are updated 4.2.1 Graph Construction
via back-propagation. We construct an undirected graph G = (V, E) to represent the
For the translation model p(c|t), we make the following relations between English tokens. V = {v1 , ..., vn } denotes all
independence assumptions: the vertices in the graph. Vertices represent English tokens needed
to be categorized. E = {eij , 1 ≤ i, j ≤ n} represents the
p(c|t) = p(ci |ti ), (9) similarities between tokens. eij represents the similarity between
ti ∈Eng vertex vi and vj . In order to reduce the computation cost of
where p(c|t) is the probability of generating Chinese word c from iteratively propagating, we also exclude edges whose value is less
English word t and can be estimated by IBM Model 1 with parallel than threshold θ to prune the edges.
corpus. For calculating the similarities between tokens, we first extract
Based on Eq.(8) and Eq.(9), the translation model Eq.(1) on a all the context words in a predefined window (the windows size
new Chinese-English mixed sentence t can be reformulated as: used in this work is 4) for each token. Context words are treated
independently for each other. We use vector space model to
ĉ = arg max p(c) × p(c|t) (10) construct context vector f, of which each dimension is calculated
c∈C
by tf · idf . Cosine measure is used to calculate the similarity
∝ arg max score(t1 t2 ..ti , u)p(ci |ti )
c∈C between tokens:
i,ti ∈Eng

According to the statistic of the crawled corpus, we observe that fi · fj
eij = cos(fi , fj ) = (11)
more than 94.6% percent Chinese-English mixed texts contain less fi fj

than 3 English words. Hence, in this work, the decoding problem
can be efficiently solved. 4.2.2 Label Propagation
Label propagation method transfers the labels from labeled data
4.2 Word Categorization to unlabeled data based on the constructed weighted graph. It has
As described in Section 2, except in-vocabulary words, there are been successfully used for many tasks [21, 15, 32, 5, 29]. In this
also a number of English tokens used as product name, informal work, we also adopt LP to obtain the categories of English tokens.
expressions, and so on. For these tokens, we propose to use label From the observation (II), we know that words tend to have
propagation (LP) [37] to classify them into different categories. In frequent co-occurrence with their category description words. For
this work, we try to classify English tokens into the following five example, more than 1.98 billions documents can be retrieved using
categories: person name, product name, organization name, slang, the query “iphone product” through Google. While, only 25
and loanword. millions documents are returned using the query “iphone informal

437
expressions”. Based on the observation, we firstly construct a
number of description words for each category. cdwij represents Table 2: Word translation results of different methods.
jth description word of the ith category. Similar as SO-PMI- Accuracy (%)
Methods
IR [31], we propose to use the following equation to measure the Top-1 Top-5 Top-10
possibility of token vi in the category zj :
D-LM † 28.9 47.6 51.9
mhits(vi NEAR cdwjk )
SO-P(vi , zj ) = max , (12) D-NLM 29.5 48.9 52.0
k=1 hits(vi ) · hits(cdwjk )

where hits(query) is the number of hits given the query query D-NLM+U 31.2 50.9 52.6
and is returned by Bing; “NEAR” is used to restrict the distance PC-LM †
60.3 80.7 84.0
between search phrases2 .
The label distribution of each vertex is initialized as follows: PC-NLM 61.4 83.8 88.1
⎧ PC-NLM+U 64.6 86.2 91.5
⎪ 1 if vi ∈ Vs and vi ∈ Vsj
⎪
⎪
⎨
0 if vi ∈ Vs and vi ∈ / Vsj Li and Yarowsky[16] 21.2 33.6 37.5
qi0 (zj ) = , (13)
⎪
⎪ SO-P(vi , zj )
⎪
⎩ m otherwise Han et al.[9] 19.6 27.2 31.3
k=1 SO-P(vi , zk )
† The in-vocabulary words based on online dictionary and parallel corpus
where qik (i = 1...|V |) is the category distribution for vertex vi take part in 66.3% and 67.6% respectively among all English tokens.
after k propagation; qik (zj ) represents the weight of a category zj
in qik ; Vsj is the set of seed words category for category zj ; Vs =
Vs1 ∪ ... ∪ Vsm is the set of seed words of all categories. included as the golden standards. For the word categorization,
With the initialization weights, label propagation method is used annotators were also asked to label categories for every words. If a
to iteratively update qik through weighted edges. In each iteration, category were labeled by more than two annotators for a word, the
the probability propagation is also under the condition that edges category is selected as the standard label of the word.
with higher similarities allow easier propagation. Category
distributions for each vertex is updated as follows:
5.2 Experiment Configurations
⎧ 0 For training the word translation model described in Section
⎪
⎪ qi (zj ) if vi ∈ Vs 3.1, we also collected 24,853 subtitles of movies and TV series
⎨
from Shooter 3 . All these subtitles contain both Chinese and
vj ∈N (vi ) eij · qj
k k−1
qi (zj ) = (zj ) , (14)
⎪
⎪ otherwise English text in single or separated files. Using these subtitles, we
⎩ eij
vj ∈N (vi ) construct a parallel corpus, which contains more than 18.5 millions
sentence pairs. For training the neural network language model, we
where N (vi ) is the set of vertices linking to vi .
randomly sampled 10 millions microblogs, due to the computation
limited. We implement the proposed method based on the code of
5. EXPERIMENTS Huang et al. [11]4 .
In this section, we describe the experimental evaluations of the FudanNLP [25] is used for Chinese word segmentation. For
proposed method. Firstly, we describe the collections used for training the word translation model, we use Giza++ toolkit [22]
evaluation and experimental setups. Secondly, the performances with the parallel corpus we constructed. For constructing the
of word translation and categorization are presented respectively. similarity graph, we implemented it using Hadoop 1.2.0 to handle
Finally, we evaluate performances of three Chinese parsers with massive computation. We also incorporate the proposed method
and without the proposed method as preprocessing step. as the preprocess step for three parsers: Stanford Parser 3.2 [28],
Berkeley Parser 1.7 [12], and FudanNLP 1.57 [25].
5.1 Collection For evaluating word translation, we adopt word-level n-best
As described in Section 2, for analyzing the phenomena of accuracy to evaluate the proposed methods. For each English token,
Chinese-English mixed text, we collected 210 millions microblogs the output is considered as correct if any of the corresponding
from Sina Weibo. For evaluating the effectiveness of the proposed golden standard words is among the top-n returned results.
methods, we used a subset of them as testing data. We randomly Evaluation metrics used for word categorization throughout the
selected 1,000 microblogs from all Chinese-English mixed ones experiments include: Precision, Recall, and F1-score.
and manually labeled the translation or categories of all English
tokens in these texts. The testing data contains 1,195 English 5.3 Word Translation Results
tokens in total. For translation model, we compare the proposed parallel corpus
Three annotators were involved in the labeling task. Since most based method with dictionary5 based method.
mixed texts contain only a small number of English tokens, the
• “D” represents dictionary based method, where all the trans-
ambiguity problem is not serious. Annotators were firstly asked
lations given by the dictionary are selected as candidates.
to provide translations for all tokens. To evaluate the quality of
corpus, we validate the agreements of human annotations using • “PC” represents parallel corpus based method, where the
Cohen’s kappa coefficient. The average κ among all annotators translation probability is given by the toolkit Giza++.
is 0.646. It indicates that the annotations of the corpus are reliable. 3
Since some words have multiple translations, one of the annotator http://www.shooter.cn. It is one of the most popular websites
which provide subtitles with Chinese translations.
made the final decision to decide which translations should be 4
http://ai.stanford.edu/ ehhuang/
2 5
We use the advanced keywords of Bing “near:10” to implement In this paper, we use dict.cn as the dictionary. It is one of the
the NEAR constraint. biggest online dictionary website in China.

438
70% 70%

65% 65%

Top-1 Accuracy
Top-1 Accuracy

60% 60%

55% 55%
50% 50%
45%
45%
40%
40%
1 2 3 4 5 6 7 8 9 10
Size of Training data # of iteration

Figure 5: The impact of number of iterations of user aware

Figure 4: Sensitivity of translation probability obtained by neural network language model.
Giza++ to number of training Data.

PC+NLM+U with different number of training data for word

For language model, we compare the following methods: translation model. Figure 4 shows the Top-1 accuracy with training
• “LM” represents the traditional n-gram language model. In data from 10% to 100% on the constructed parallel corpus. From
this work, we also use the toolkit Giza++ to train the model. the results, we can observe that the size of parallel corpus has a
certain effect on the performance word translation model. We think
• “NLM” represents proposed by method proposed by Col- that the increasing number of in-vocabulary words along with size
lobert and Weston [4]. of parallel corpus is one of main reasons.
To show the performance impact of the number of iterations
• “NLM+U” represents the proposed user aware neural net- of user aware neural network language model, we evaluate the
work language model. accuracy of word translation method with different number of
iterations. The results are shown in Fig. 5. From the figure, we
Some existing works focused on constructing vocabularies for can observe that it can achieve satisfactory accuracy with only
informal expressions can also be adapted for this task. In this one iteration. The top-1 accuracy achieved by user aware neural
work, we re-implemented works done by Han et al. [9] and Li and network language model with only one iteration is even better than
Yarowsky [16]. Since the string similarities between English tokens the performance achieved by n-gram language models.
and their corresponding Chinese words are zero, the re-ranking part We also analyze the errors of the proposed method (PC-
[9] is ignored. The window size is set to 3. Bi-gram is used to NLM+U) and find several types of words which cannot be well
calculate the context similarity. For the method proposed by Li processed except the low frequency words. The first of these
and Yarowsky [16], we used the data-driven hypothesis generation types are words which do not have Chinese translations in special
method and optimal weights in the log-liner model they used in context, for example, movie names, album names, etc. The second
their work. Vocabulary words generated based on parallel corpus one are multiple words which should be translated as phrases.
are treated as the informal expressions. The third one are abbreviations which have multiple meanings.
The results of different word translation methods are shown For example, “PE” can be used as the abbreviation of physical
in Table 2. From the results, we can observe that the proposed education, product engineering, private equity fund, and so on. We
method, which incorporates noisy channel approach with neural will leave these types of words for future work.
network language model, achieve better performance than other
methods. Comparing the results of dictionary based method 5.4 Word Categorization Results
with parallel corpus based method, we can see that the parallel
Table 3 shows the word categorization results of different meth-
corpus based method achieve better performance with all language
ods in five categories: person name, product name, organization
models, although the two methods have the similar in-vocabulary
name, informal expressions, and loanwords. We compare the
words percentages. The main reason may be that movie and TV
following methods:
series have the similar language usages as social media. Comparing
the results with different language models, we can observe that user • “LP” represents the original label propagation without ini-
aware neural network language model is better than neural network tialization method described in Eq.(12).
language model and n-gram language model. It demonstrates
that the user historical information can benefit this task. Since • “INIT” represents results of the method only based on
the methods proposed by [16] and [9] mainly focus on frequent Eq.(13).
informal expressions, only English words with high frequency
can be well processed by their work. However, these methods • “INIT+LP/WS” denotes the proposed method with seed
can process abbreviation, pronunciation, and some other kinds of words.
English tokens.
For investigating the impact of training data used to estimate • “INIT+LP/WOS” represents the proposed method without
the word translation model, we showed the performance of seed words.

439
Table 3: Word categorization results of different methods.
Person Name Product Name Org. Name Slang Loanwords
Methods Acc.
P R F P R F P R F P R F P R F
LP 37.5 23.8 29.1 93.9 18.4 30.8 57.7 17.4 26.8 19.2 31.3 23.8 8.8 39.1 14.4 22.5
INIT 23.1 60.0 33.3 36.2 27.6 31.3 28.1 52.3 36.6 27.9 45.8 34.6 42.9 13.0 20.0 32.3
INIT+LP/WOS 80.3 10.0 17.2 84.9 36.8 51.4 34.4 89.5 49.7 40.3 56.3 47.0 50.0 4.4 8.0 42.9
INIT+LP/WS 90.9 19.2 31.7 86.4 58.6 69.8 39.0 84.9 53.4 48.7 75.0 59.0 57.1 17.4 26.7 55.8

60%
Table 4: The perfomances of POS tagging of English
tokens by different parsers with/without the proposed method.
55% “WP” and “WoP” represent the accuracy with and without
normalization method respectively.
Accuracy

Methods WoP WP
50%
Stanford Parser 3.2 66.4% 84.7%
Berkeley Parser 1.7 67.9% 83.9%
45%
FudanNLP 1.57 54.0% 79.6%

40%
0 1 2 3 4 5 6 7 8 9 10 of seed words per category is more than 4 words, the performances
# seed words per category increase slowly. We think that lack of context information is one of
the main reasons.
Figure 6: The impact of number of seed words per category.
5.5 Applications
To show the effectiveness of the proposed method as the
preprocessing step for other NLP tasks, we evaluate our method
For the methods INIT+LP/WS and LP, 4 seed words are used for with three Chinese parsers. In this work, we only focus on the
each category. The total number of out-of-vocabulary words are relative changes caused by English tokens. We randomly select
387, which takes about 32.3% percents among all English tokens. 100 microblogs from the whole evaluation set. The in-vocabulary
From the results, we can observe that the propose method words are translated into Chinese. The words which cannot be
(INIT+LP/WS) achieves the best performances in four of the translated are replaced by their category names.
five categories. The accuracy of the proposed method is also Table 4 shows the accuracy of POS tagging of English tokens
significantly better than other methods. Comparing the results with with and without the proposed normalization method. From the
and without the proposed initialization method (INIT+LP/WS v.s. results, we can observe that all three parsers benefit a lot from the
LP), we can observe that initialization contributes a lot. The relative proposed normalization method in processing mixed texts. The
accuracy improvement is more than 148%. It demonstrates that relative improvements are significant. Since features extracted
the proposed method can achieve better performance with a small from words play important rules in existing methods, lack of
number of seed words. word information may highly impact their performances. For the
From the Table 3, we also observe that all the methods cannot methods which label all English tokens with a same tag [24], the
achieve satisfactory performances for the categories of person proposed method can bring much more benefit for them.
name and loanwords. We analyze the errors in these categories
and find that lots of errors are caused by acronyms. Person
names may be represented by acronyms. However, some of them 6. CONCLUSIONS
are also used as the organization name. Since the number of In this paper, we focus on the task of normalizing the Chinese-
webpages about a organization is usually larger than the number English mixed texts. We firstly analyzed the phenomena of mixed
of webpages describing a person, most of them are classified into usage in Chinese. Then, we propose to use word translation and
organization category. This is also one of the main reason of categorization to achieve the task. For word translation, we use
why all the methods achieve low precision in organization name noisy-channel approach with neural network language model to
category. These acronyms cannot be correctly classified without translate in-vocabulary words. A novel user aware neural network
context information. language model is introduced to capture the useful historical
To show the performance impact of the number of seed words, information of users. For categorizing words, a graph-based
we evaluate the accuracy of word categorization method with unsupervised method is proposed . We also introduce a novel
different number of seed words. The results are shown in Fig. 6. initialization technique to improve the effectiveness. Experimental
From the figure, we can observe that the accuracy is significantly results show that the top-5 word translation accuracy of proposed
improved with seed words comparing to the method without seed method achieves 86.2%. For word categorization, the propose
words. The accuracies are improving continuously along with the method also achieves significant improvement over the baseline
increasing the number of seed words. Although, when the number methods. The relative improvement over the original label

440
propagation method is more than 148%. We also incorporate the Stroudsburg, PA, USA, 2012. Association for Computational
proposed method as preprocessing steps for three different parsers. Linguistics.
All of them benefit a lot from the proposed method. [10] B. Han, P. Cook, and T. Baldwin. Lexical normalization for
social media text. ACM Trans. Intell. Syst. Technol.,
7. ACKNOWLEDGEMENT 4(1):5:1–5:27, Feb. 2013.
The authors wish to thank the anonymous reviewers for their [11] E. H. Huang, R. Socher, C. D. Manning, and A. Y. Ng.
helpful comments. This work was partially funded by National Improving word representations via global context and
Natural Science Foundation of China (61003092, 61073069), multiple word prototypes. In Proceedings of the 50th Annual
National Major Science and Technology Special Project of China Meeting of the Association for Computational Linguistics:
(2014ZX03006005), Shanghai Municipal Science and Technol- Long Papers - Volume 1, ACL ’12, pages 873–882,
ogy Commission (No.12511504502), Key Projects in the Na- Stroudsburg, PA, USA, 2012. Association for Computational
tional Science & Technology Pillar Program(2012BAH18B01) Linguistics.
and “Chen Guang” project supported by Shanghai Municipal [12] M. Johnson and A. E. Ural. Reranking the berkeley and
Education Commission and Shanghai Education Development brown parsers. In Human Language Technologies: The 2010
Foundation(11CG05). Annual Conference of the North American Chapter of the
Association for Computational Linguistics, pages 665–668,
8. REFERENCES Los Angeles, California, June 2010. Association for
Computational Linguistics.
[1] A. Aw, M. Zhang, J. Xiao, and J. Su. A phrase-based
statistical model for sms text normalization. In Proceedings [13] C. Kobus, F. Yvon, and G. Damnati. Normalizing sms: are
of the COLING/ACL 2006 Main Conference Poster Sessions, two metaphors better than one? In Proceedings of the 22nd
pages 33–40, Sydney, Australia, July 2006. Association for International Conference on Computational Linguistics -
Computational Linguistics. Volume 1, COLING ’08, pages 441–448, Stroudsburg, PA,
USA, 2008. Association for Computational Linguistics.
[2] R. Beaufort, S. Roekhaut, L.-A. Cougnon, and C. Fairon. A
hybrid rule/model-based finite-state framework for [14] C. Li and Y. Liu. Improving text normalization using
normalizing sms messages. In Proceedings of the 48th character-blocks based models and system combination. In
Annual Meeting of the Association for Computational Proceedings of COLING 2012, pages 1587–1602, Mumbai,
Linguistics, ACL ’10, pages 770–779, Stroudsburg, PA, India, December 2012. The COLING 2012 Organizing
USA, 2010. Association for Computational Linguistics. Committee.
[3] J.-S. Chang and W.-L. Teng. Mining atomic chinese [15] X. Li, Y.-Y. Wang, and A. Acero. Learning query intent from
abbreviations with a probabilistic single character recovery regularized click graphs. In Proceedings of the 31st annual
model. Language Resources and Evaluation, international ACM SIGIR conference on Research and
40(3-4):367–374, 2006. development in information retrieval, SIGIR ’08, pages
339–346, New York, NY, USA, 2008. ACM.
[4] R. Collobert and J. Weston. A unified architecture for natural
language processing: deep neural networks with multitask [16] Z. Li and D. Yarowsky. Mining and modeling relations
learning. In Proceedings of the 25th international conference between formal and informal chinese phrases from web
on Machine learning, ICML ’08, pages 160–167, New York, corpora. In Proceedings of the Conference on Empirical
NY, USA, 2008. ACM. Methods in Natural Language Processing, EMNLP ’08,
[5] D. Das and S. Petrov. Unsupervised part-of-speech tagging pages 1031–1040, Stroudsburg, PA, USA, 2008. Association
with bilingual graph-based projections. In Proceedings of the for Computational Linguistics.
49th Annual Meeting of the Association for Computational [17] Z. Li and D. Yarowsky. Unsupervised translation induction
Linguistics: Human Language Technologies, pages 600–609, for chinese abbreviations using monolingual corpora. In
Portland, Oregon, USA, June 2011. Association for Proceedings of ACL-08: HLT, pages 425–433, Columbus,
Computational Linguistics. Ohio, June 2008. Association for Computational Linguistics.
[6] L. Derczynski, A. Ritter, S. Clark, and K. Bontcheva. Twitter [18] F. Liu, F. Weng, and X. Jiang. A broad-coverage
part-of-speech tagging for all: Overcoming sparse and noisy normalization system for social media language. In
data. In Proceedings of the International Conference on Proceedings of the 50th Annual Meeting of the Association
Recent Advances in Natural Language Processing. for Computational Linguistics: Long Papers - Volume 1,
Association for Computational Linguistics, 2013. ACL ’12, pages 1035–1044, Stroudsburg, PA, USA, 2012.
[7] D. Freitag. Machine learning for information extraction in Association for Computational Linguistics.
informal domains. Machine Learning, 39(2-3):169–202, [19] E. Minkov, R. C. Wang, and W. W. Cohen. Extracting
2000. personal names from email: applying named entity
[8] B. Han and T. Baldwin. Lexical normalisation of short text recognition to informal text. In Proceedings of the
messages: Makn sens a #twitter. In Proceedings of the 49th conference on Human Language Technology and Empirical
Annual Meeting of the Association for Computational Methods in Natural Language Processing, HLT ’05, pages
Linguistics: Human Language Technologies, pages 368–378, 443–450, Stroudsburg, PA, USA, 2005. Association for
Portland, Oregon, USA, June 2011. Association for Computational Linguistics.
Computational Linguistics. [20] T. Mullen and R. Malouf. A preliminary investigation into
[9] B. Han, P. Cook, and T. Baldwin. Automatically constructing sentiment analysis of informal political discourse. In
a normalisation dictionary for microblogs. In Proceedings of Proceedings of AAAI-2006 Spring Symposium on
the 2012 Joint Conference on Empirical Methods in Natural Computational Approaches to Analyzing Weblogs, 2006.
Language Processing and Computational Natural Language [21] Z.-Y. Niu, D.-H. Ji, and C. L. Tan. Word sense
Learning, EMNLP-CoNLL ’12, pages 421–432, disambiguation using label propagation based

441
semi-supervised learning. In Proceedings of the 43rd Annual [30] M. Thelwall, K. Buckley, G. Paltoglou, D. Cai, and
Meeting on Association for Computational Linguistics, ACL A. Kappas. Sentiment strength detection in short informal
’05, pages 395–402, Stroudsburg, PA, USA, 2005. text. Journal of the American Society for Information
Association for Computational Linguistics. Science and Technology, 61(12):2544–2558, 2010.
[22] F. J. Och and H. Ney. A systematic comparison of various [31] P. D. Turney and M. L. Littman. Unsupervised learning of
statistical alignment models. Comput. Linguist., semantic orientation from a hundred-billion-word corpus.
29(1):19–51, Mar. 2003. (No. ERB-1094, NRC #44929): National Research Council
[23] N. Okazaki, M. Ishizuka, and J. Tsujii. A discriminative of Canada, 2002.
approach to japanese abbreviation extraction. In Proceedings [32] L. Velikovich, S. Blair-Goldensohn, K. Hannan, and
of the Third International Joint Conference on Natural R. McDonald. The viability of web-derived polarity lexicons.
Language Processing (IJCNLP 2008), pages 889–894, 2008. In Human Language Technologies: The 2010 Annual
[24] X. Qian, Q. Zhang, X. Huang, and L. Wu. 2d trie for fast Conference of the North American Chapter of the
parsing. In Proceedings of the 23rd International Conference Association for Computational Linguistics, HLT ’10, pages
on Computational Linguistics, COLING ’10, pages 904–912, 777–785, Stroudsburg, PA, USA, 2010. Association for
Stroudsburg, PA, USA, 2010. Association for Computational Computational Linguistics.
Linguistics. [33] P. Wang and H. T. Ng. A beam-search decoder for
[25] X. Qiu, Q. Zhang, and X. Huang. Fudannlp: A toolkit for normalization of social media text with application to
chinese natural language processing. In Proceedings of ACL, machine translation. In Proceedings of the 2013 Conference
2013. of the North American Chapter of the Association for
[26] G. Richard. A global perspective on bilingualism and Computational Linguistics: Human Language Technologies,
bilingual education. Georgetown University Round Table on pages 471–481, Atlanta, Georgia, June 2013. Association for
Languages and Linguistics 1999: Language in Our Time: Computational Linguistics.
Bilingual Education and Official English, Ebonics and [34] L.-X. Xie, Y.-B. Zheng, Z.-Y. Liu, M.-S. Sun, and C.-H.
Standard English, Immigration and the Unz Initiative Wang. Extracting chinese abbreviation-definition pairs from
Languages and Linguistics 1999, page 332, 2001. anchor texts. In Machine Learning and Cybernetics
[27] A. Ritter, S. Clark, Mausam, and O. Etzioni. Named entity (ICMLC), volume 4, pages 1485–1491, 2011.
recognition in tweets: an experimental study. In Proceedings [35] D. Yang, Y.-C. Pan, and S. Furui. Vocabulary expansion
of the Conference on Empirical Methods in Natural through automatic abbreviation generation for chinese voice
Language Processing, EMNLP ’11, pages 1524–1534, search. Computer Speech & Language, 26(5):321 – 335,
Stroudsburg, PA, USA, 2011. Association for Computational 2012.
Linguistics. [36] J. Zhao, X. Qiu, S. Zhang, F. Ji, and X. Huang.
[28] R. Socher, J. Bauer, C. D. Manning, and A. Y. Ng. Parsing Part-of-speech tagging for chinese-english mixed texts with
with compositional vector grammars. In Proceedings of ACL dynamic features. In Proceedings of the 2012 Joint
2013, June 2013. Conference on Empirical Methods in Natural Language
[29] A. Tamura, T. Watanabe, and E. Sumita. Bilingual lexicon Processing and Computational Natural Language Learning,
extraction from comparable corpora using label propagation. EMNLP-CoNLL ’12, pages 1379–1388, Stroudsburg, PA,
In Proceedings of the 2012 Joint Conference on Empirical USA, 2012. Association for Computational Linguistics.
Methods in Natural Language Processing and [37] X. Zhu and Z. Ghahramani. Learning from Labeled and
Computational Natural Language Learning, Unlabeled Data with Label Propagation. In Technical Report
EMNLP-CoNLL ’12, pages 24–36, Stroudsburg, PA, USA, CMU-CALD-02-107. Carnegie Mellon University, 2002.
2012. Association for Computational Linguistics.

442

Introduction To C Programming Notes PDF
100% (2)
Introduction To C Programming Notes PDF
425 pages
Ccmatrix: Mining Billions of High-Quality Parallel Sentences On The Web
No ratings yet
Ccmatrix: Mining Billions of High-Quality Parallel Sentences On The Web
13 pages
How Good Is Your Model?: Andreas Müller
No ratings yet
How Good Is Your Model?: Andreas Müller
54 pages
S10 1
No ratings yet
S10 1
477 pages
Ed 3 Book
No ratings yet
Ed 3 Book
636 pages
Multilingual Information Retrieval
No ratings yet
Multilingual Information Retrieval
18 pages
An Online Punjabi Shahmukhi Lexical Resource
100% (1)
An Online Punjabi Shahmukhi Lexical Resource
7 pages
thesis-wei
No ratings yet
thesis-wei
111 pages
2401.14559
No ratings yet
2401.14559
132 pages
4Ja1-Tc & 4Jh1-Tc Engine: Engine Management System Operation & Diagnosis
100% (1)
4Ja1-Tc & 4Jh1-Tc Engine: Engine Management System Operation & Diagnosis
91 pages
An Effective Bi-LSTM Word Embedding System For Analysis and Identification of Language in Code-Mixed Social Media Text in English and Roman Hindi
No ratings yet
An Effective Bi-LSTM Word Embedding System For Analysis and Identification of Language in Code-Mixed Social Media Text in English and Roman Hindi
13 pages
Ulllted States Patent (10) Patent N0.: US 8,275,604 B2
No ratings yet
Ulllted States Patent (10) Patent N0.: US 8,275,604 B2
21 pages
Demos 2012
No ratings yet
Demos 2012
550 pages
CHAPTER 1
No ratings yet
CHAPTER 1
78 pages
Mfsadi Thesis
No ratings yet
Mfsadi Thesis
72 pages
Graph-based Bilingual Word Embedding for Statistical Machine Translation
No ratings yet
Graph-based Bilingual Word Embedding for Statistical Machine Translation
24 pages
Social Media Text Analytics of Malayalam - English Code Mixed Using Deep Learning
No ratings yet
Social Media Text Analytics of Malayalam - English Code Mixed Using Deep Learning
25 pages
Generating Topic-Based Chatbot Responses: Amandus Krantz Petrus Lindblom
No ratings yet
Generating Topic-Based Chatbot Responses: Amandus Krantz Petrus Lindblom
40 pages
Technical Report: Learning Compound Noun Semantics
No ratings yet
Technical Report: Learning Compound Noun Semantics
167 pages
Text Modication Methods For Natural Language Generation: Universitat Autònoma de Barcelona
No ratings yet
Text Modication Methods For Natural Language Generation: Universitat Autònoma de Barcelona
44 pages
Aca 19 CLH
No ratings yet
Aca 19 CLH
39 pages
Natural Language processing-Regular-HO
No ratings yet
Natural Language processing-Regular-HO
10 pages
MOD-1
No ratings yet
MOD-1
71 pages
1737631797631ChbQtCZ1Fw1weC1c
No ratings yet
1737631797631ChbQtCZ1Fw1weC1c
15 pages
2019 Book CyberSecurity PDF
No ratings yet
2019 Book CyberSecurity PDF
184 pages
NLP paper
No ratings yet
NLP paper
5 pages
Operations Research
100% (1)
Operations Research
359 pages
Cross-Lingual Transfer Learning in NLP Enhancing English Language Learning for Non-Native Speakers
No ratings yet
Cross-Lingual Transfer Learning in NLP Enhancing English Language Learning for Non-Native Speakers
6 pages
OptiScan - Early Warning Sign of Cataract Detection Using
No ratings yet
OptiScan - Early Warning Sign of Cataract Detection Using
87 pages
Fulltext01 PDF
No ratings yet
Fulltext01 PDF
86 pages
text classification reseach paper
No ratings yet
text classification reseach paper
4 pages
01_Introduction to Text Analytics_part2
No ratings yet
01_Introduction to Text Analytics_part2
48 pages
research_abesec_mlir neel
No ratings yet
research_abesec_mlir neel
7 pages
Demos 000
No ratings yet
Demos 000
6 pages
Compiler Design Unit 5
100% (1)
Compiler Design Unit 5
28 pages
Project Phase1
No ratings yet
Project Phase1
23 pages
FIFA Manager 13
No ratings yet
FIFA Manager 13
35 pages
Accpac (SAGE) : Group 1 Presentation
No ratings yet
Accpac (SAGE) : Group 1 Presentation
35 pages
A Stop List For General Text
No ratings yet
A Stop List For General Text
17 pages
Project Report
No ratings yet
Project Report
12 pages
MS-Computer Science-XII
67% (3)
MS-Computer Science-XII
13 pages
Agile Testing
No ratings yet
Agile Testing
32 pages
IMS Pointers
No ratings yet
IMS Pointers
14 pages
Formal Report
0% (1)
Formal Report
17 pages
Rmit Thesis Repository
100% (3)
Rmit Thesis Repository
4 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
Dynamic Discovery of Type Classes and Relations in Semantic Web Data
No ratings yet
Dynamic Discovery of Type Classes and Relations in Semantic Web Data
26 pages
Sample
No ratings yet
Sample
8 pages
Philippine Christian University
100% (2)
Philippine Christian University
50 pages
Scan 300-1200 (E)
No ratings yet
Scan 300-1200 (E)
16 pages
Advances in Natural Language Processing
No ratings yet
Advances in Natural Language Processing
7 pages
Document Centered Approach To Text Normalization
No ratings yet
Document Centered Approach To Text Normalization
8 pages
Mobile Business Intelligence
No ratings yet
Mobile Business Intelligence
9 pages
Natural Language Processing Handout
No ratings yet
Natural Language Processing Handout
8 pages
Aa Aca 01 PDF
No ratings yet
Aa Aca 01 PDF
2 pages
Computer Network Lesson Plan
No ratings yet
Computer Network Lesson Plan
3 pages
Christopher Strachey's Nineteen-Fifties Love Machine
No ratings yet
Christopher Strachey's Nineteen-Fifties Love Machine
6 pages
A Comparative Study For Arabic Text Classification Algorithms Based On Stop Words Elimination
No ratings yet
A Comparative Study For Arabic Text Classification Algorithms Based On Stop Words Elimination
5 pages
Autosky
No ratings yet
Autosky
8 pages
Top Interview Questions Asked To A Penetration Tester
No ratings yet
Top Interview Questions Asked To A Penetration Tester
11 pages
Classifying Arabic Web Pages Toolkit
No ratings yet
Classifying Arabic Web Pages Toolkit
4 pages
A Suggestion-Based RDF Instance Matching System: January 2017
No ratings yet
A Suggestion-Based RDF Instance Matching System: January 2017
6 pages
Ford Systemsthinking
No ratings yet
Ford Systemsthinking
41 pages
Classroom Training Vs Online Learning
No ratings yet
Classroom Training Vs Online Learning
26 pages
Assignment PDF
No ratings yet
Assignment PDF
2 pages
An Unsupervised Model For Text Message Normalization
No ratings yet
An Unsupervised Model For Text Message Normalization
8 pages
CIPCAMPTIWL Android User Manual V2.0
No ratings yet
CIPCAMPTIWL Android User Manual V2.0
6 pages
PDS Hochiki
No ratings yet
PDS Hochiki
2 pages
KEA Practical Automatic Keyphrase Extraction
No ratings yet
KEA Practical Automatic Keyphrase Extraction
2 pages
E-STUDIO6570c Series Electrical Troubleshooting Guide v01
No ratings yet
E-STUDIO6570c Series Electrical Troubleshooting Guide v01
7 pages
Group 17 Elements
No ratings yet
Group 17 Elements
3 pages
Translating the Future: Exploring the Impact of Technology and AI on Modern Translation Studies
From Everand
Translating the Future: Exploring the Impact of Technology and AI on Modern Translation Studies
Tian Chuanmao
No ratings yet
Agentive Cognitive Construction Grammar: Mind, Agency and the Materiality of Language: Agentive Cognitive Construction Grammar
From Everand
Agentive Cognitive Construction Grammar: Mind, Agency and the Materiality of Language: Agentive Cognitive Construction Grammar
Sergio Torres-Martínez
No ratings yet
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
From Everand
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
Timothy King
No ratings yet
Natural Language Processing with NLTK: Definitive Reference for Developers and Engineers
From Everand
Natural Language Processing with NLTK: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Disambiguation of Particles: Hindi-To-English
From Everand
Disambiguation of Particles: Hindi-To-English
Anil Thakur
No ratings yet
The Magic of Formal Languages
From Everand
The Magic of Formal Languages
Pasquale De Marco
No ratings yet
Teaching Digital Literacies
From Everand
Teaching Digital Literacies
Joel Bloch
No ratings yet
Task-based grammar teaching of English: Where cognitive grammar and task-based language teaching meet
From Everand
Task-based grammar teaching of English: Where cognitive grammar and task-based language teaching meet
Susanne Niemeier
3.5/5 (4)
The Language of Localization
From Everand
The Language of Localization
Katherine Brown-Hoekstra
1/5 (1)
Mastering Large Language Models: Advanced techniques, applications, cutting-edge methods, and top LLMs (English Edition)
From Everand
Mastering Large Language Models: Advanced techniques, applications, cutting-edge methods, and top LLMs (English Edition)
Sanket Subhash Khandare
No ratings yet
Language, Linguistics, and Development Simplified
From Everand
Language, Linguistics, and Development Simplified
Narinder Mehra
No ratings yet
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
From Everand
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
The Enigmatic Bridge: Computing and Linguistics
From Everand
The Enigmatic Bridge: Computing and Linguistics
Pasquale De Marco
No ratings yet
Analysis of a Medical Research Corpus: A Prelude for Learners, Teachers, Readers and Beyond
From Everand
Analysis of a Medical Research Corpus: A Prelude for Learners, Teachers, Readers and Beyond
Georgette Nicolas Jabbour
No ratings yet
Integrating Information into the Engineering Design Process
From Everand
Integrating Information into the Engineering Design Process
Michael Fosmire
3.5/5 (2)
Investigating Tasks in Formal Language Learning
From Everand
Investigating Tasks in Formal Language Learning
María del Pilar García Mayo
No ratings yet
Large Language Models
From Everand
Large Language Models
A. Scholtens
2/5 (2)
Applied Linguistics: A Genre Analysis Of: Research Articles Results and Discussion Sections in Journals Published in Applied Linguistics
From Everand
Applied Linguistics: A Genre Analysis Of: Research Articles Results and Discussion Sections in Journals Published in Applied Linguistics
Veronica M. Mutinda
No ratings yet
The spaCy Handbook: Simplifying Natural Language Processing
From Everand
The spaCy Handbook: Simplifying Natural Language Processing
Robert Johnson
No ratings yet
An Introduction to Functional Programming Through Lambda Calculus
From Everand
An Introduction to Functional Programming Through Lambda Calculus
Greg Michaelson
No ratings yet
Collaborative Writing in L2 Classrooms
From Everand
Collaborative Writing in L2 Classrooms
Neomy Storch
No ratings yet
Natural Language Processing
From Everand
Natural Language Processing
Ajit Singh
No ratings yet
Natural Language Processing: Fundamentals and Applications
From Everand
Natural Language Processing: Fundamentals and Applications
Fouad Sabry
No ratings yet
Speech Recognition: Fundamentals and Applications
From Everand
Speech Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Introduction to Programming Languages
From Everand
Introduction to Programming Languages
IntroBooks Team
4/5 (1)
Statistical Semantics: Fundamentals and Applications
From Everand
Statistical Semantics: Fundamentals and Applications
Fouad Sabry
No ratings yet
Natural Language Understanding: Fundamentals and Applications
From Everand
Natural Language Understanding: Fundamentals and Applications
Fouad Sabry
No ratings yet
Explanation Based Learning: Fundamentals and Applications
From Everand
Explanation Based Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Chinese-English Mixed Text Normalization

Uploaded by

Chinese-English Mixed Text Normalization

Uploaded by

Chinese-English Mixed Text Normalization

Qi Zhang, Huan Chen, Xuanjing Huang

• We formalize the English word normalization problem in

c∈C importance of the given word wi . In this paper idf-weighting is

Figure 3: Overview structure of the proposed user-aware neural language model.

Figure 5: The impact of number of iterations of user aware

PC+NLM+U with different number of training data for word

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.