0% found this document useful (0 votes)
49 views8 pages

Language Modelling

This document summarizes a research paper that explores using Bloom filters for language modeling in statistical machine translation (SMT). The authors show how Bloom filters can store n-grams from large corpora and higher-order n-gram models in a space-efficient way while allowing queries in constant time. They also introduce a log-frequency Bloom filter that can associate approximate frequency information with n-grams efficiently. Experiments show these models can complement conventional n-gram language models in SMT systems.

Uploaded by

mmihai_popa2006
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views8 pages

Language Modelling

This document summarizes a research paper that explores using Bloom filters for language modeling in statistical machine translation (SMT). The authors show how Bloom filters can store n-grams from large corpora and higher-order n-gram models in a space-efficient way while allowing queries in constant time. They also introduce a log-frequency Bloom filter that can associate approximate frequency information with n-grams efficiently. Experiments show these models can complement conventional n-gram language models in SMT systems.

Uploaded by

mmihai_popa2006
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 512519,

Prague, Czech Republic, June 2007. c 2007 Association for Computational Linguistics
Randomised Language Modelling for Statistical Machine Translation
David Talbot and Miles Osborne
School of Informatics, University of Edinburgh
2 Buccleuch Place, Edinburgh, EH8 9LW, UK
d.r.talbot@sms.ed.ac.uk, miles@inf.ed.ac.uk
Abstract
A Bloom lter (BF) is a randomised data
structure for set membership queries. Its
space requirements are signicantly below
lossless information-theoretic lower bounds
but it produces false positives with some
quantiable probability. Here we explore the
use of BFs for language modelling in statis-
tical machine translation.
We show how a BF containing n-grams can
enable us to use much larger corpora and
higher-order models complementing a con-
ventional n-gram LM within an SMT sys-
tem. We also consider (i) how to include ap-
proximate frequency information efciently
within a BF and (ii) how to reduce the er-
ror rate of these models by rst checking for
lower-order sub-sequences in candidate n-
grams. Our solutions in both cases retain the
one-sided error guarantees of the BF while
taking advantage of the Zipf-like distribution
of word frequencies to reduce the space re-
quirements.
1 Introduction
Language modelling (LM) is a crucial component in
statistical machine translation (SMT). Standard n-
gram language models assign probabilities to trans-
lation hypotheses in the target language, typically as
smoothed trigram models, e.g. (Chiang, 2005). Al-
though it is well-known that higher-order LMs and
models trained on additional monolingual corpora
can yield better translation performance, the chal-
lenges in deploying large LMs are not trivial. In-
creasing the order of an n-gram model can result in
an exponential increase in the number of parameters;
for corpora such as the English Gigaword corpus, for
instance, there are 300 million distinct trigrams and
over 1.2 billion 5-grams. Since a LMmay be queried
millions of times per sentence, it should ideally re-
side locally in memory to avoid time-consuming re-
mote or disk-based look-ups.
Against this background, we consider a radically
different approach to language modelling: instead
of explicitly storing all distinct n-grams, we store a
randomised representation. In particular, we show
that the Bloom lter (Bloom (1970); BF), a sim-
ple space-efcient randomised data structure for rep-
resenting sets, may be used to represent statistics
from larger corpora and for higher-order n-grams to
complement a conventional smoothed trigrammodel
within an SMT decoder.
1
The space requirements of a Bloom lter are quite
spectacular, falling signicantly below information-
theoretic error-free lower bounds while query times
are constant. This efciency, however, comes at the
price of false positives: the lter may erroneously
report that an item not in the set is a member. False
negatives, on the other hand, will never occur: the
error is said to be one-sided.
In this paper, we show that a Bloom lter can be
used effectively for language modelling within an
SMT decoder and present the log-frequency Bloom
lter, an extension of the standard Boolean BF that
1
For extensions of the framework presented here to stand-
alone smoothed Bloom lter language models, we refer the
reader to a companion paper (Talbot and Osborne, 2007).
512
takes advantage of the Zipf-like distribution of cor-
pus statistics to allow frequency information to be
associated with n-grams in the lter in a space-
efcient manner. We then propose a mechanism,
sub-sequence ltering, for reducing the error rates
of these models by using the fact that an n-grams
frequency is bound from above by the frequency of
its least frequent sub-sequence.
We present machine translation experiments us-
ing these models to represent information regarding
higher-order n-grams and additional larger mono-
lingual corpora in combination with conventional
smoothed trigram models. We also run experiments
with these models in isolation to highlight the im-
pact of different order n-grams on the translation
process. Finally we provide some empirical analysis
of the effectiveness of both the log frequency Bloom
lter and sub-sequence ltering.
2 The Bloom lter
In this section, we give a brief overview of the
Bloomlter (BF); refer to Broder and Mitzenmacher
(2005) for a more in detailed presentation. ABF rep-
resents a set o = x
1
, x
2
, ..., x
n
with n elements
drawn from a universe | of size N. The structure is
attractive when N n. The only signicant stor-
age used by a BF consists of a bit array of size m.
This is initially set to hold zeroes. To train the lter
we hash each item in the set k times using distinct
hash functions h
1
, h
2
, ..., h
k
. Each function is as-
sumed to be independent from each other and to map
items in the universe to the range 1 to m uniformly
at random. The k bits indexed by the hash values
for each item are set to 1; the item is then discarded.
Once a bit has been set to 1 it remains set for the life-
time of the lter. Distinct items may not be hashed
to k distinct locations in the lter; we ignore col-
lisons. Bits in the lter can, therefore, be shared by
distinct items allowing signicant space savings but
introducing a non-zero probability of false positives
at test time. There is no way of directly retrieving or
ennumerating the items stored in a BF.
At test time we wish to discover whether a given
item was a member of the original set. The lter is
queried by hashing the test item using the same k
hash functions. If all bits referenced by the k hash
values are 1 then we assume that the item was a
member; if any of them are 0 then we know it was
not. True members are always correctly identied,
but a false positive will occur if all k corresponding
bits were set by other items during training and the
item was not a member of the training set. This is
known as a one-sided error.
The probability of a false postive, f, is clearly the
probability that none of k randomly selected bits in
the lter are still 0 after training. Letting p be the
proportion of bits that are still zero after these n ele-
ments have been inserted, this gives,
f = (1 p)
k
.
As n items have been entered in the lter by hashing
each k times, the probability that a bit is still zero is,
p

1
1
m

kn
e

kn
m
which is the expected value of p. Hence the false
positive rate can be approximated as,
f = (1 p)
k
(1 p

)
k

1 e

kn
m

k
.
By taking the derivative we nd that the number of
functions k

that minimizes f is,


k

= ln 2
m
n
.
which leads to the intuitive result that exactly half
the bits in the lter will be set to 1 when the optimal
number of hash functions is chosen.
The fundmental difference between a Bloom l-
ters space requirements and that of any lossless rep-
resentation of a set is that the former does not depend
on the size of the (exponential) universe N from
which the set is drawn. A lossless representation
scheme (for example, a hash map, trie etc.) must de-
pend on N since it assigns a distinct representation
to each possible set drawn from the universe.
3 Language modelling with Bloom lters
In our experiments we make use of both standard
(i.e. Boolean) BFs containing n-gram types drawn
from a training corpus and a novel BF scheme, the
log-frequency Bloom lter, that allows frequency
information to be associated efciently with items
stored in the lter.
513
Algorithm 1 Training frequency BF
Input: o
train
, h
1
, ...h
k
and BT =
Output: BT
for all x o
train
do
c(x) frequency of n-gram x in o
train
qc(x) quantisation of c(x) (Eq. 1)
for j = 1 to qc(x) do
for i = 1 to k do
h
i
(x) hash of event x, j under h
i
BT[h
i
(x)] 1
end for
end for
end for
return BT
3.1 Log-frequency Bloom lter
The efciency of our scheme for storing n-gram
statistics within a BF relies on the Zipf-like distribu-
tion of n-gram frequencies in natural language cor-
pora: most events occur an extremely small number
of times, while a small number are very frequent.
We quantise raw frequencies, c(x), using a loga-
rithmic codebook as follows,
qc(x) = 1 +log
b
c(x)|. (1)
The precision of this codebook decays exponentially
with the raw counts and the scale is determined by
the base of the logarithmb; we examine the effect of
this parameter in experiments below.
Given the quantised count qc(x) for an n-gram x,
the lter is trained by entering composite events con-
sisting of the n-gram appended by an integer counter
j that is incremented from 1 to qc(x) into the lter.
To retrieve the quantised count for an n-gram, it is
rst appended with a count of 1 and hashed under
the k functions; if this tests positive, the count is in-
cremented and the process repeated. The procedure
terminates as soon as any of the k hash functions hits
a 0 and the previous count is reported. The one-sided
error of the BF and the training scheme ensure that
the actual quantised count cannot be larger than this
value. As the counts are quantised logarithmically,
the counter will be incremented only a small number
of times. The training and testing routines are given
here as Algorithms 1 and 2 respectively.
Errors for the log-frequency BF scheme are one-
sided: frequencies will never be underestimated.
Algorithm 2 Test frequency BF
Input: x, MAXQCOUNT, h
1
, ...h
k
and BT
Output: Upper bound on qc(x) o
train
for j = 1 to MAXQCOUNT do
for i = 1 to k do
h
i
(x) hash of event x, j under h
i
if BT[h
i
(x)] = 0 then
return j 1
end if
end for
end for
The probability of overestimating an items fre-
quency decays exponentially with the size of the
overestimation error d (i.e. as f
d
for d > 0) since
each erroneous increment corresponds to a single
false positive and d such independent events must
occur together.
3.2 Sub-sequence ltering
The error analysis in Section 2 focused on the false
positive rate of a BF; if we deploy a BF within an
SMT decoder, however, the actual error rate will also
depend on the a priori membership probability of
items presented to it. The error rate Err is,
Err = Pr(x / S
train
[Decoder)f.
This implies that, unlike a conventional lossless data
structure, the models accuracy depends on other
components in system and how it is queried.
We take advantage of the monotonicity of the n-
gram event space to place upper bounds on the fre-
quency of an n-gramprior to testing for it in the lter
and potentially truncate the outer loop in Algorithm
2 when we know that the test could only return pos-
tive in error.
Specically, if we have stored lower-order n-
grams in the lter, we can infer that an n-gram can-
not present, if any of its sub-sequences test nega-
tive. Since our scheme for storing frequencies can
never underestimate an items frequency, this rela-
tion will generalise to frequencies: an n-grams fre-
quency cannot be greater than the frequency of its
least frequent sub-sequence as reported by the lter,
c(w
1
, ..., w
n
) minc(w
1
, ..., w
n1
), c(w
2
, ..., w
n
).
We use this to reduce the effective error rate of BF-
LMs that we use in the experiments below.
514
3.3 Bloom lter language model tests
A standard BF can implement a Boolean language
model test: have we seen some fragment of lan-
guage before? This does not use any frequency in-
formation. The Boolean BF-LM is a standard BF
containing all n-grams of a certain length in the
training corpus, o
train
. It implements the following
binary feature function in a log-linear decoder,

bool
(x) (x o
train
)
Separate Boolean BF-LMs can be included for
different order n and assigned distinct log-linear
weights that are learned as part of a minimum error
rate training procedure (see Section 4).
The log-frequency BF-LM implements a multino-
mial feature function in the decoder that returns the
value associated with an n-gram by Algorithm 2.

logfreq
(x) qc(x) o
train
Sub-sequence ltering can be performed by using
the minimum value returned by lower-order models
as an upper-bound on the higher-order models.
By boosting the score of hypotheses containing n-
grams observed in the training corpus while remain-
ing agnostic for unseen n-grams (with the exception
of errors), these feature functions have more in com-
mon with maximum entropy models than conven-
tionally smoothed n-gram models.
4 Experiments
We conducted a range of experiments to explore the
effectiveness and the error-space trade-off of Bloom
lters for language modelling in SMT. The space-
efciency of these models also allows us to inves-
tigate the impact of using much larger corpora and
higher-order n-grams on translation quality. While
our main experiments use the Bloom lter models in
conjunction with a conventional smoothed trigram
model, we also present experiments with these mod-
els in isolation to highlight the impact of different
order n-grams on the translation process. Finally,
we present some empirical analysis of both the log-
frequency Bloom lter and the sub-sequence lter-
ing technique which may be of independent interest.
Model EP-KN-3 EP-KN-4 AFP-KN-3
Memory 64M 99M 1.3G
gzip size 21M 31M 481M
1-gms 62K 62K 871K
2-gms 1.3M 1.3M 16M
3-gms 1.1M 1.0M 31M
4-gms N/A 1.1M N/A
Table 1: Baseline and Comparison Models
4.1 Experimental set-up
All of our experiments use publically available re-
sources. We use the French-English section of the
Europarl (EP) corpus for parallel data and language
modelling (Koehn, 2003) and the English Giga-
word Corpus (LDC2003T05; GW) for additional
language modelling.
Decoding is carried-out using the Moses decoder
(Koehn and Hoang, 2007). We hold out 500 test sen-
tences and 250 development sentences from the par-
allel text for evaluation purposes. The feature func-
tions in our models are optimised using minimum
error rate training and evaluation is performed using
the BLEU score.
4.2 Baseline and comparison models
Our baseline LM and other comparison models are
conventional n-gram models smoothed using modi-
ed Kneser-Ney and built using the SRILM Toolkit
(Stolcke, 2002); as is standard practice these models
drop entries for n-grams of size 3 and above when
the corresponding discounted count is less than 1.
The baseline language model, EP-KN-3, is a trigram
model trained on the English portion of the parallel
corpus. For additional comparisons we also trained a
smoothed 4-gram model on this Europarl data (EP-
KN-4) and a trigram model on the Agence France
Press section of the Gigaword Corpus (AFP-KN-3).
Table 1 shows the amount of memory these mod-
els take up on disk and compressed using the gzip
utility in parentheses as well as the number of dis-
tinct n-grams of each order. We give the gzip com-
pressed size as an optimistic lower bound on the size
of any lossless representation of each model.
2
2
Note, in particular, that gzip compressed les do not sup-
port direct random access as required by our application.
515
Corpus Europarl Gigaword
1-gms 61K 281K
2-gms 1.3M 5.4M
3-gms 4.7M 275M
4-gms 9.0M 599M
5-gms 10.3M 842M
6-gms 10.7M 957M
Table 2: Number of distinct n-grams
4.3 Bloom lter-based models
To create Bloom lter LMs we gathered n-gram
counts from both the Europarl (EP) and the whole
of the Gigaword Corpus (GW). Table 2 shows the
numbers of distinct n-grams in these corpora. Note
that we use no pruning for these models and that
the numbers of distinct n-grams is of the same or-
der as that of the recently released Google Ngrams
dataset (LDC2006T13). In our experiments we cre-
ate a range of models referred to by the corpus used
(EP or GW), the order of the n-gram(s) entered into
the lter (1 to 10), whether the model is Boolean
(Bool-BF) or provides frequency information (Freq-
BF), whether or not sub-sequence ltering was used
(FTR) and whether it was used in conjunction with
the baseline trigram (+EP-KN-3).
4.4 Machine translation experiments
Our rst set of experiments examines the relation-
ship between memory allocated to the BF and BLEU
score. We present results using the Boolean BF-
LM in isolation and then both the Boolean and log-
frequency BF-LMS to add 4-grams to our baseline
3-gram model.Our second set of experiments adds
3-grams and 5-grams from the Gigaword Corpus to
our baseline. Here we constrast the Boolean BF-
LM with the log-frequency BF-LM with different
quantisation bases (2 = ne-grained and 5 = coarse-
grained). We then evaluate the sub-sequence l-
tering approach to reducing the actual error rate of
these models by adding both 3 and 4-grams from the
Gigaword Corpus to the baseline. Since the BF-LMs
easily allow us to deploy very high-order n-gram
models, we use them to evaluate the impact of dif-
ferent order n-grams on the translation process pre-
senting results using the Boolean and log-frequency
BF-LM in isolation for n-grams of order 1 to 10.
Model EP-KN-3 EP-KN-4 AFP-KN-3
BLEU 28.51 29.24 29.17
Memory 64M 99M 1.3G
gzip size 21M 31M 481M
Table 3: Baseline and Comparison Models
4.5 Analysis of BF extensions
We analyse our log-frequency BF scheme in terms
of the additional memory it requires and the error
rate compared to a non-redundant scheme. The non-
redundant scheme involves entering just the exact
quantised count for each n-gram and then searching
over the range of possible counts at test time starting
with the count with maximum a priori probability
(i.e. 1) and incrementing until a count is found or
the whole codebook has been searched (here the size
is 16).
We also analyse the sub-sequence ltering
scheme directly by creating a BF with only 3-grams
and a BF containing both 2-grams and 3-grams and
comparing their actual error rates when presented
with 3-grams that are all known to be negatives.
5 Results
5.1 Machine translation experiments
Table 3 shows the results of the baseline (EP-KN-
3) and other conventional n-gram models trained on
larger corpora (AFP-KN-3) and using higher-order
dependencies (EP-KN-4). The larger models im-
prove somewhat on the baseline performance.
Figure 1 shows the relationship between space al-
located to the BF models and BLEU score (left) and
false positive rate (right) respectively. These experi-
ments do not include the baseline model. We can see
a clear correlation between memory / false positive
rate and translation performance.
Adding 4-grams in the form of a Boolean BF or a
log-frequency BF (see Figure 2) improves on the 3-
gram baseline with little additional memory (around
4MBs) while performing on a par with or above
the Europarl 4-gram model with around 10MBs;
this suggests that a lossy representation of the un-
pruned set of 4-grams contains more useful informa-
tion than a lossless representation of the pruned set.
3
3
An unpruned modied Kneser-Ney 4-gram model on the
Eurpoparl data scores slightly higher - 29.69 - while taking up
489MB (132MB gzipped).
516
29
28
27
26
25
10 8 6 4 2
1
0.8
0.6
0.4
0.2
0
B
L
E
U

S
c
o
r
e
F
a
l
s
e

p
o
s
i
t
i
v
e

r
a
t
e
Memory in MB
Europarl Boolean BF 4-gram (alone)
BLEU Score Bool-BF-EP-4
False positive rate
Figure 1: Space/Error vs. BLEU Score.
30.5
30
29.5
29
28.5
28
9 7 5 3 1
B
L
E
U

S
c
o
r
e
Memory in MB
EP-Bool-BF-4 and Freq-BF-4 (with EP-KN-3)
EP-Bool-BF-4 + EP-KN-3
EP-Freq-BF-4 + EP-KN-3
EP-KN-4 comparison (99M / 31M gzip)
EP-KN-3 baseline (64M / 21M gzip)
Figure 2: Adding 4-grams with Bloom lters.
As the false positive rate exceeds 0.20 the perfor-
mance is severly degraded. Adding 3-grams drawn
from the whole of the Gigaword corpus rather than
simply the Agence France Press section results in
slightly improved performance with signcantly less
memory than the AFP-KN-3 model (see Figure 3).
Figure 4 shows the results of adding 5-grams
drawn from the Gigaword corpus to the baseline. It
also contrasts the Boolean BF and the log-frequency
BF suggesting in this case that the log-frequency BF
can provide useful information when the quantisa-
tion base is relatively ne-grained (base 2). The
Boolean BF and the base 5 (coarse-grained quan-
tisation) log-frequency BF perform approximately
the same. The base 2 quantisation performs worse
30.5
30
29.5
29
28.5
28
27.5
27
1 0.8 0.6 0.4 0.2 0.1
B
L
E
U

S
c
o
r
e
Memory in GB
GW-Bool-BF-3 and GW-Freq-BF-3 (with EP-KN-3)
GW-Bool-BF-3 + EP-KN-3
GW-Freq-BF-3 + EP-KN-3
AFP-KN-3 + EP-KN-3
Figure 3: Adding GW 3-grams with Bloom lters.
30.5
30
29.5
29
28.5
28
1 0.8 0.6 0.4 0.2 0.1
B
L
E
U

S
c
o
r
e
Memory in GB
GW-Bool-BF-5 and GW-Freq-BF-5 (base 2 and 5) (with EP-KN-3)
GW-Bool-BF-5 + EP-KN-3
GW-Freq-BF-5 (base 2) + EP-KN-3
GW-Freq-BF-5 (base 5) + EP-KN-3
AFP-KN-3 + EP-KN-3
Figure 4: Comparison of different quantisation rates.
for smaller amounts of memory, possibly due to the
larger set of events it is required to store.
Figure 5 shows sub-sequence ltering resulting in
a small increase in performance when false positive
rates are high (i.e. less memory is allocated). We
believe this to be the result of an increased a pri-
ori membership probability for n-grams presented
to the lter under the sub-sequence ltering scheme.
Figure 6 shows that for this task the most useful
n-gram sizes are between 3 and 6.
5.2 Analysis of BF extensions
Figure 8 compares the memory requirements of
the log-frequencey BF (base 2) and the Boolean
517
31
30
29
28
1 0.8 0.6 0.4 0.2
B
L
E
U

S
c
o
r
e
Memory in MB
GW-Bool-BF-3-4-FTR and GW-Bool-BF-3-4 (with EP-KN-3)
GW-Bool-BF-3-4-FTR + EP-KN-3
GW-Bool-BF-3-4 + EP-KN-3
Figure 5: Effect of sub-sequence ltering.
27
26
25
24
10 9 8 7 6 5 4 3 2 1
B
L
E
U

S
c
o
r
e
N-gram order
EP-Bool-BF and EP-Freq-BF with different order N-grams (alone)
EP-Bool-BF
EP-Freq-BF
Figure 6: Impact of n-grams of different sizes.
BF for various order n-gram sets from the Giga-
word Corpus with the same underlying false posi-
tive rate (0.125). The additional space required by
our scheme for storing frequency information is less
than a factor of 2 compared to the standard BF.
Figure 7 shows the number and size of frequency
estimation errors made by our log-frequency BF
scheme and a non-redundant scheme that stores only
the exact quantised count. We presented 500K nega-
tives to the lter and recorded the frequency of over-
estimation errors of each size. As shown in Section
3.1, the probability of overestimating an items fre-
quency under the log-frequency BF scheme decays
exponentially in the size of this overestimation er-
ror. Although the non-redundant scheme requires
0
10
20
30
40
50
60
70
80
16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
F
r
e
q
u
e
n
c
y

(
K
)
Size of overestimation error
Frequency Estimation Errors on 500K Negatives
Log-frequency BF (Bloom error = 0.159)
Non-redundant scheme (Bloom error = 0.076)
Figure 7: Frequency estimation errors.
0
100
200
300
400
500
600
700
1 2 3 4 5 6 7
M
e
m
o
r
y

(
M
B
)
N-gram order (Gigaword)
Memory requirements for 0.125 false positive rate
Bool-BF
Freq-BF (log base-2 quantisation)
Figure 8: Comparison of memory requirements.
fewer items be stored in the lter and, therefore, has
a lower underlying false positive rate (0.076 versus
0.159), in practice it incurs a much higher error rate
(0.717) with many large errors.
Figure 9 shows the impact of sub-sequence lter-
ing on the actual error rate. Although, the false pos-
itive rate for the BF containing 2-grams, in addition,
to 3-grams (ltered) is higher than the false positive
rate of the unltered BF containing only 3-grams,
the actual error rate of the former is lower for mod-
els with less memory. By testing for 2-grams prior
to querying for the 3-grams, we can avoid perform-
ing some queries that may otherwise have incurred
errors using the fact that a 3-gram cannot be present
if one of its constituent 2-grams is absent.
518
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.5 1 1.5 2 2.5 3 3.5
E
r
r
o
r

r
a
t
e
Memory (MB)
Error rate with sub-sequence filtering
Filtered false positive rate
Unfiltered false pos rate / actual error rate
Filtered actual error rate
Figure 9: Error rate with sub-sequence ltering.
6 Related Work
We are not the rst people to consider building very
large scale LMs: Kumar et al. used a four-gram
LM for re-ranking (Kumar et al., 2005) and in un-
published work, Google used substantially larger n-
grams in their SMT system. Deploying such LMs
requires either a cluster of machines (and the over-
heads of remote procedure calls), per-sentence l-
tering (which again, is slow) and/or the use of some
other lossy compression (Goodman and Gao, 2000).
Our approach can complement all these techniques.
Bloom lters have been widely used in database
applications for reducing communications over-
heads and were recently applied to encode word
frequencies in information retrieval (Linari and
Weikum, 2006) using a method that resembles the
non-redundant scheme described above. Exten-
sions of the BF to associate frequencies with items
in the set have been proposed e.g., (Cormode and
Muthukrishn, 2005); while these schemes are more
general than ours, they incur greater space overheads
for the distributions that we consider here.
7 Conclusions
We have shown that Bloom Filters can form the ba-
sis for space-efcient language modelling in SMT.
Extending the standard BF structure to encode cor-
pus frequency information and developing a strat-
egy for reducing the error rates of these models by
sub-sequence ltering, our models enable higher-
order n-grams and larger monolingual corpora to be
used more easily for language modelling in SMT.
In a companion paper (Talbot and Osborne, 2007)
we have proposed a framework for deriving con-
ventional smoothed n-gram models from the log-
frequency BF scheme allowing us to do away en-
tirely with the standard n-gram model in an SMT
system. We hope the present work will help estab-
lish the Bloom lter as a practical alternative to con-
ventional associative data structures used in compu-
tational linguistics. The framework presented here
shows that with some consideration for its workings,
the randomised nature of the Bloom lter need not
be a signicant impediment to is use in applications.
References
B. Bloom. 1970. Space/time tradeoffs in hash coding with
allowable errors. CACM, 13:422426.
A. Broder and M. Mitzenmacher. 2005. Network applications
of bloom lters: A survey. Internet Mathematics, 1(4):485
509.
David Chiang. 2005. A hierarchical phrase-based model for
statistical machine translation. In Proceedings of the 43rd
Annual Meeting of the Association for Computational Lin-
guistics (ACL05), pages 263270, Ann Arbor, Michigan.
G. Cormode and S. Muthukrishn. 2005. An improved data
stream summary: the count-min sketch and its applications.
Journal of Algorithms, 55(1):5875.
J. Goodman and J. Gao. 2000. Language model size reduction
by pruning and clustering. In ICSLP00, Beijing, China.
Philipp Koehn and Hieu Hoang. 2007. Factored translation
models. In Proc. of the 2007 Conference on Empirical Meth-
ods in Natural Language Processing (EMNLP/Co-NLL).
P. Koehn. 2003. Europarl: A multilingual corpus for eval-
uation of machine translation philipp koehn, draft. Available
at:http://people.csail.mit.edu/ koehn/publications/europarl.ps.
S. Kumar, Y. Deng, and W. Byrne. 2005. Johns Hopkins Uni-
versity - Cambridge University Chinese-English and Arabic-
English 2005 NIST MT Evaluation Systems. In Proceedings
of 2005 NIST MT Workshop, June.
Alessandro Linari and Gerhard Weikum. 2006. Efcient peer-
to-peer semantic overlay networks based on statistical lan-
guage models. In Proceedings of the International Workshop
on IR in Peer-to-Peer Networks, pages 916, Arlington.
Andreas Stolcke. 2002. SRILM An extensible language mod-
eling toolkit. In Proc. of the Intl. Conf. on Spoken Lang.
Processing, 2002.
David Talbot and Miles Osborne. 2007. Smoothed Bloom l-
ter language models: Tera-scale LMs on the cheap. In Pro-
ceedings of the 2007 Conference on Empirical Methods in
Natural Language Processing (EMNLP/Co-NLL), June.
519

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy