0% found this document useful (0 votes)

88 views7 pages

Scalable Modified Kneser-Ney Language Model Estimation

We present an efficient algorithm to estimate large modified Kneser-Ney models including interpolation. Streaming and sorting enables the algorithm to scale to much larger models by using a fixed amount of RAM and variable amount of disk.

Uploaded by

Pete aa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

88 views7 pages

Scalable Modified Kneser-Ney Language Model Estimation

Uploaded by

Pete aa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Scalable Modified Kneser-Ney Language Model Estimation

Kenneth Heafield∗,† Ivan Pouzyrevsky‡ Jonathan H. Clark† Philipp Koehn∗

∗ † ‡
University of Edinburgh Carnegie Mellon University Yandex
10 Crichton Street 5000 Forbes Avenue Zelenograd, bld. 455 fl. 128
Edinburgh EH8 9AB, UK Pittsburgh, PA 15213, USA Moscow 124498, Russia
heafield@cs.cmu.edu ivan.pouzyrevsky@gmail.com jhclark@cs.cmu.edu pkoehn@inf.ed.ac.uk

Abstract MapReduce Steps Optimized

Filesystem Filesystem
We present an efficient algorithm to es- Map Map
timate large modified Kneser-Ney mod-
Reduce 1 Reduce 1
els including interpolation. Streaming
and sorting enables the algorithm to scale Filesystem Reduce 2
to much larger models by using a fixed ..
Identity Map .
amount of RAM and variable amount of
Reduce 2
disk. Using one machine with 140 GB
RAM for 2.8 days, we built an unpruned Filesystem
..
model on 126 billion tokens. Machine .
translation experiments with this model
show improvement of 0.8 BLEU point Figure 1: Each MapReduce performs three copies
over constrained systems for the 2013 over the network when only one is required. Ar-
Workshop on Machine Translation task in rows denote copies over the network (i.e. to and
three language pairs. Our algorithm is also from a distributed filesystem). Both options use
faster for small models: we estimated a local disk within each reducer for merge sort.
model on 302 million tokens using 7.7%
of the RAM and 14.0% of the wall time
contributes an efficient multi-pass streaming algo-
taken by SRILM. The code is open source
rithm using disk and a user-specified amount of
as part of KenLM.
RAM.
1 Introduction
2 Related Work
Relatively low perplexity has made modified
Kneser-Ney smoothing (Kneser and Ney, 1995; Brants et al. (2007) showed how to estimate
Chen and Goodman, 1998) a popular choice for Kneser-Ney models with a series of five MapRe-
language modeling. However, existing estima- duces (Dean and Ghemawat, 2004). On 31 billion
tion methods require either large amounts of RAM words, estimation took 400 machines for two days.
(Stolcke, 2002) or machines (Brants et al., 2007). Recently, Google estimated a pruned Kneser-Ney
As a result, practitioners have chosen to use model on 230 billion words (Chelba and Schalk-
less data (Callison-Burch et al., 2012) or simpler wyk, 2013), though no cost was provided.
smoothing methods (Brants et al., 2007). Each MapReduce consists of one layer of map-
Backoff-smoothed n-gram language models pers and an optional layer of reducers. Mappers
(Katz, 1987) assign probability to a word wn in read from a network filesystem, perform optional
context w1n−1 according to the recursive equation processing, and route data to reducers. Reducers
( process input and write to a network filesystem.
n−1 p(wn |w1n−1 ), if w1n was seen Ideally, reducers would send data directly to an-
p(wn |w1 ) =
b(w1n−1 )p(wn |w2n ), otherwise other layer of reducers, but this is not supported.
Their workaround, a series of MapReduces, per-
The task is to estimate probability p and backoff forms unnecessary copies over the network (Fig-
b from text for each seen entry w1n . This paper ure 1). In both cases, reducers use local disk.
Writing and reading from the distributed filesys- Corpus
tem improves fault tolerance. However, the same
Counting
level of fault tolerance could be achieved by
checkpointing to the network filesystem then only Adjusting Counts
reading in the case of failures. Doing so would en-
able reducers to start processing without waiting Summing Division
for the network filesystem to write all the data. Interpolation
Our code currently runs on a single machine
while MapReduce targets clusters. Appuswamy et Model
al. (2013) identify several problems with the scale-
out approach of distributed computation and put Figure 2: Data flow in the estimation pipeline.
forward several scenarios in which a single ma- Normalization has two threads per order: sum-
chine scale-up approach is more cost effective in ming and division. Thick arrows indicate sorting.
terms of both raw performance and performance
per dollar.
It also offers a disk-based pipeline for initial steps
Brants et al. (2007) contributed Stupid Backoff,
(i.e. counting). However, the later steps store
a simpler form of smoothing calculated at runtime
all n-grams that survived count pruning in RAM.
from counts. With Stupid Backoff, they scaled to
Without pruning, both options use the same RAM.
1.8 trillion tokens. We agree that Stupid Backoff
IRSTLM (Federico et al., 2008) does not imple-
is cheaper to estimate, but contend that this work
ment modified Kneser-Ney but rather an approxi-
makes Kneser-Ney smoothing cheap enough.
mation dubbed “improved Kneser-Ney” (or “mod-
Another advantage of Stupid Backoff has been ified shift-beta” depending on the version). Esti-
that it stores one value, a count, per n-gram in- mation is done in RAM. It can also split the corpus
stead of probability and backoff. In previous work into pieces and separately build each piece, intro-
(Heafield et al., 2012), we showed how to collapse ducing further approximation.
probability and backoff into a single value without
changing sentence-level probabilities. However, 3 Estimation Pipeline
local scores do change and, like Stupid Backoff,
are no longer probabilities. Estimation has four streaming passes: counting,
MSRLM (Nguyen et al., 2007) aims to scal- adjusting counts, normalization, and interpolation.
ably estimate language models on a single ma- Data is sorted between passes, three times in total.
chine. Counting is performed with streaming algo- Figure 2 shows the flow of data.
rithms similarly to this work. Their parallel merge
sort also has the potential to be faster than ours. 3.1 Counting
The biggest difference is that their pipeline de- For a language model of order N , this step counts
lays some computation (part of normalization and all N -grams (with length exactly N ) by streaming
all of interpolation) until query time. This means through the corpus. Words near the beginning of
that it cannot produce a standard ARPA file and sentence also form N -grams padded by the marker
that more time and memory are required at query <s> (possibly repeated multiple times). The end
time. Moreover, they use memory mapping on en- of sentence marker </s> is appended to each sen-
tire files and these files may be larger than physi- tence and acts like a normal token.
cal RAM. We have found that, even with mostly- Unpruned N -gram counts are sufficient, so
sequential access, memory mapping is slower be- lower-order n-grams (n < N ) are not counted.
cause the kernel does not explicitly know where Even pruned models require unpruned N -gram
to read ahead or write behind. In contrast, we use counts to compute smoothing statistics.
dedicated threads for reading and writing. Perfor- Vocabulary mapping is done with a hash table.1
mance comparisons are omitted because we were Token strings are written to disk and a 64-bit Mur-
unable to compile and run MSRLM on recent ver-
1
sions of Linux. This hash table is the only part of the pipeline that can
grow. Users can specify an estimated vocabulary size for
SRILM (Stolcke, 2002) estimates modified memory budgeting. In future work, we plan to support lo-
Kneser-Ney models by storing n-grams in RAM. cal vocabularies with renumbering.
Suffix Context The
P difficulty lies in computing denominator
n−1
3 2 1 2 1 3 x a(w 1 x) for all w1n−1 . For this, we sort in
Z B A Z A B context order (Figure 3) so that, for every w1n−1 ,
Z A B B B B the entries w1n−1 x are consecutive. One pass col-
B B B Z B A lects both the denominator and backoff3 terms
|{x : a(w1n−1 x) = i}| for i ∈ [1, 3].
PA problem
n−1
arises in that denominator
Figure 3: In suffix order, the last word is primary. x a(w1 x) is known only after streaming
In context order, the penultimate word is primary. through all w1n−1 x, but is needed immediately
to compute each u(wn |w1n−1 ). One option is to
buffer in memory, taking O(N |vocabulary|) space
murHash2 token identifier is retained in RAM.
since each order is run independently in parallel.
Counts are combined in a hash table and spilled
Instead, we use two threads for each P order. The
to disk when a fixed amount of memory is full.
sum thread reads ahead to compute x a(w1n−1 x)
Merge sort also combines identical N -grams (Bit-
and b(w1n−1 ) then places these in a secondary
ton and DeWitt, 1983).
stream. The divide thread reads the input and the
3.2 Adjusting Counts secondary stream then writes records of the form
The counts c are replaced with adjusted counts a. (w1n , u(wn |w1n−1 ), b(w1n−1 )) (1)
(
c(w1n ), if n = N or w1 = <s> The secondary stream is short so that data read by
a(w1n ) = the sum thread will likely be cached when read by
|v : c(vw1n ) > 0|, otherwise
the divide thread. This sort of optimization is not
Adjusted counts are computed by streaming possible with most MapReduce implementations.
through N -grams sorted in suffix order (Figure 3). Because normalization streams through w1n−1 x
The algorithm keeps a running total a(wiN ) for in context order, the backoffs b(w1n−1 ) are com-
each i and compares consecutive N -grams to de- puted in suffix order. This will be useful later
cide which adjusted counts to output or increment. (§3.5), so backoffs are written to secondary files
Smoothing statistics are also collected. For each (one for each order) as bare values without keys.
length n, it collects the number tn,k of n-grams
3.4 Interpolation
with adjusted count k ∈ [1, 4].
Chen and Goodman (1998) found that perplex-
tn,k = |{w1n : a(w1n ) = k}| ity improves when the various orders within the
same model are interpolated. The interpolation
These are used to compute closed-form estimates
step computes final probability p according to the
(Chen and Goodman, 1998) of discounts Dn (k)
recursive equation
(k + 1)tn,1 tn,k+1 p(wn |w1n−1 ) = u(wn |w1n−1 )+b(w1n−1 )p(wn |w2n−1 )
Dn (k) = k −
(tn,1 + 2tn,2 )tn,k (2)
for k ∈ [1, 3]. Other cases are Dn (0) = 0 and Recursion terminates when unigrams are interpo-
Dn (k) = Dn (3) for k ≥ 3. Less formally, counts lated with the uniform distribution
0 (unknown) through 2 have special discounts. 1
p(wn ) = u(wn ) + b()
|vocabulary|
3.3 Normalization where denotes the empty string. The unknown
Normalization computes pseudo probability u word counts as part of the vocabulary and has
count zero,4 so its probability is b()/|vocabulary|.
a(w1n ) − Dn (a(w1n ))
u(wn |w1n−1 ) = P n−1 3
Sums and counts are done with exact integer arithmetic.
x a(w1 x)
Thus, every floating-point value generated by our toolkit is
the result of O(N ) floating-point operations. SRILM has nu-
and backoff b merical precision issues because it uses O(N |vocabulary|)
P3 floating-point operations to compute backoff.
i=1 Dn (i)|{x : a(w1n−1 x) = i}|
b(w1n−1 )
4
= SRILM implements “another hack” that computes
P n−1
x a(w1 x) pSRILM (wn ) = u(wn ) and pSRILM (<unk>) = b() when-
ever p(<unk>) < 3 × 10−6 , as it usually is. We implement
2
https://code.google.com/p/smhasher/ both and suspect their motivation was numerical precision.
Probabilities are computed by streaming in suf-
n ,
fix lexicographic order: wn appears before wn−1 50
which in turn appears before wn−2n . In this way,

p(wn ) is computed before it is needed to compute

40
p(wn |wn−1 ), and so on. This is implemented by

RAM (GB)
jointly iterating through N streams, one for each
30
length of n-gram. The relevant pseudo probability
u(wn |w1n−1 ) and backoff b(w1n−1 ) appear in the
input records (Equation 1). 20 SRI
SRI compact
3.5 Joining 10 IRST
This work
The last task is to unite b(w1n ) computed in §3.3
with p(wn |w1n−1 ) computed in §3.4 for storage in 0
0 200 400 600 800 1000
the model. We note that interpolation (Equation 2) Tokens (millions)
used the different backoff b(w1n−1 ) and so b(w1n )
is not immediately available. However, the back-
Figure 4: Peak virtual memory usage.
off values were saved in suffix order (§3.3) and in-
terpolation produces probabilities in suffix order.
During the same streaming pass as interpolation, 14
SRI
we merge the two streams.5 Suffix order is also 12 SRI compact
convenient because the popular reverse trie data IRST
structure can be built in the same pass.6 10 This work
CPU time (hours)

4 Sorting 8

Much work has been done on efficient disk-based 6

merge sort. Particularly important is arity, the
4
number of blocks that are merged at once. Low
arity leads to more passes while high arity in- 2
curs more disk seeks. Abello and Vitter (1999)
modeled these costs and derived an optimal strat- 0
0 200 400 600 800 1000
egy: use fixed-size read buffers (one for each Tokens (millions)
block being merged) and set arity to the number of
buffers that fit in RAM. The optimal buffer size is
Figure 5: CPU usage (system plus user).
hardware-dependent; we use 64 MB by default. To
overcome the operating system limit on file han-
dles, multiple blocks are stored in the same file. Each n-gram record is an array of n vocabu-
To further reduce the costs of merge sort, we lary identifiers (4 bytes each) and an 8-byte count
implemented pipelining (Dementiev et al., 2008). or probability and backoff. At peak, records are
If there is enough RAM, input is lazily merged stored twice on disk because lazy merge sort is
and streamed to the algorithm. Output is cut into not easily amenable to overwriting the input file.
blocks, sorted in the next step’s desired order, and Additional costs are the secondary backoff file (4
then written to disk. These optimizations elim- bytes per backoff) and the vocabulary in plaintext.
inate up to two copies to disk if enough RAM
is available. Input, the algorithm, block sorting, 5 Experiments
and output are all threads on a chain of producer-
Experiments use ClueWeb09.7 After spam filter-
consumer queues. Therefore, computation and
ing (Cormack et al., 2011), removing markup, se-
disk operations happen simultaneously.
lecting English, splitting sentences (Koehn, 2005),
5
Backoffs only exist if the n-gram is the context of some deduplicating, tokenizing (Koehn et al., 2007),
n + 1-gram, so merging skips n-grams that are not contexts. and truecasing, 126 billion tokens remained.
6
With quantization (Whittaker and Raj, 2001), the quan-
7
tizer is trained in a first pass and applied in a second pass. http://lemurproject.org/clueweb09/
1 2 3 4 5 Source Baseline Large
393 3,775 17,629 39,919 59,794 Czech 27.4 28.2
French 32.6 33.4
Table 1: Counts of unique n-grams (in millions) Spanish 31.8 32.6
for the 5 orders in the large LM.
Table 2: Uncased BLEU results from the 2013
Workshop on Machine Translation.
5.1 Estimation Comparison
We estimated unpruned language models in bi-
nary format on sentences randomly sampled from 2011) then copied to a machine with 1 TB RAM.
ClueWeb09. SRILM and IRSTLM were run un- Better compression methods (Guthrie and Hepple,
til the test machine ran out of RAM (64 GB). 2010; Talbot and Osborne, 2007) and distributed
For our code, the memory limit was set to 3.5 language models (Brants et al., 2007) could reduce
GB because larger limits did not improve perfor- hardware requirements. Feature weights were re-
mance on this small data. Results are in Figures tuned with PRO (Hopkins and May, 2011) for
4 and 5. Our code used an average of 1.34–1.87 Czech-English and batch MIRA (Cherry and Fos-
CPUs, so wall time is better than suggested in Fig- ter, 2012) for French-English and Spanish-English
ure 5 despite using disk. Other toolkits are single- because these worked best for the baseline. Un-
threaded. SRILM’s partial disk pipeline is not cased BLEU scores on the 2013 test set are shown
shown; it used the same RAM and took more time. in Table 2. The improvement is remarkably con-
IRSTLM’s splitting approximation took 2.5 times sistent at 0.8 BLEU point in each language pair.
as much CPU and about one-third the memory (for
6 Conclusion
a 3-way split) compared with normal IRSTLM.
For 302 million tokens, our toolkit used 25.4% Our open-source (LGPL) estimation code is avail-
of SRILM’s CPU time, 14.0% of the wall time, able from kheafield.com/code/kenlm/
and 7.7% of the RAM. Compared with IRSTLM, and should prove useful to the community. Sort-
our toolkit used 16.4% of the CPU time, 9.0% of ing makes it scalable; efficient merge sort makes
the wall time, and 16.6% of the RAM. it fast. In future work, we plan to extend to the
Common Crawl corpus and improve parallelism.
5.2 Scaling
We built an unpruned model (Table 1) on 126 bil- Acknowledgements
lion tokens. Estimation used a machine with 140 Miles Osborne preprocessed ClueWeb09. Mo-
GB RAM and six hard drives in a RAID5 configu- hammed Mediani contributed to early designs.
ration (sustained read: 405 MB/s). It took 123 GB Jianfeng Gao clarified how MSRLM operates.
RAM, 2.8 days wall time, and 5.4 CPU days. A This work used the Extreme Science and Engi-
summary of Google’s results from 2007 on differ- neering Discovery Environment (XSEDE), which
ent data and hardware appears in §2. is supported by National Science Foundation grant
We then used this language model as an ad- number OCI-1053575. We used Stampede and
ditional feature in unconstrained Czech-English, Trestles under allocation TG-CCR110017. Sys-
French-English, and Spanish-English submissions tem administrators from the Texas Advanced
to the 2013 Workshop on Machine Translation.8 Computing Center (TACC) at The University of
Our baseline is the University of Edinburgh’s Texas at Austin made configuration changes on
phrase-based Moses (Koehn et al., 2007) submis- our request. This work made use of the resources
sion (Durrani et al., 2013), which used all con- provided by the Edinburgh Compute and Data Fa-
strained data specified by the evaluation (7 billion cility (http://www.ecdf.ed.ac.uk/). The
tokens of English). It placed first by BLEU (Pap- ECDF is partially supported by the eDIKT ini-
ineni et al., 2002) among constrained submissions tiative (http://www.edikt.org.uk/). The
in each language pair we consider. research leading to these results has received fund-
In order to translate, the large model was quan- ing from the European Union Seventh Framework
tized (Whittaker and Raj, 2001) to 10 bits and Programme (FP7/2007-2013) under grant agree-
compressed to 643 GB with KenLM (Heafield, ment 287658 (EU BRIDGE).
8
http://statmt.org/wmt13/
References Nadir Durrani, Barry Haddow, Kenneth Heafield, and
Philipp Koehn. 2013. Edinburgh’s machine trans-
James M. Abello and Jeffrey Scott Vitter, editors. lation systems for European language pairs. In Pro-
1999. External memory algorithms. American ceedings of the ACL 2013 Eighth Workshop on Sta-
Mathematical Society, Boston, MA, USA. tistical Machine Translation, Sofia, Bulgaria, Au-
gust.
Raja Appuswamy, Christos Gkantsidis, Dushyanth
Narayanan, Orion Hodson, and Antony Rowstron. Marcello Federico, Nicola Bertoldi, and Mauro Cet-
2013. Nobody ever got fired for buying a cluster. tolo. 2008. IRSTLM: an open source toolkit for
Technical Report MSR-TR-2013-2, Microsoft Re- handling large scale language models. In Proceed-
search. ings of Interspeech, Brisbane, Australia.

Dina Bitton and David J DeWitt. 1983. Duplicate David Guthrie and Mark Hepple. 2010. Storing the
record elimination in large data files. ACM Trans- web in memory: Space efficient language mod-
actions on database systems (TODS), 8(2):255–265. els with constant time retrieval. In Proceedings of
EMNLP 2010, Los Angeles, CA.
Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J.
Och, and Jeffrey Dean. 2007. Large language Kenneth Heafield, Philipp Koehn, and Alon Lavie.
models in machine translation. In Proceedings of 2012. Language model rest costs and space-efficient
the 2007 Joint Conference on Empirical Methods storage. In Proceedings of the 2012 Joint Confer-
in Natural Language Processing and Computational ence on Empirical Methods in Natural Language
Language Learning, pages 858–867, June. Processing and Computational Natural Language
Learning, Jeju Island, Korea.
Chris Callison-Burch, Philipp Koehn, Christof Monz,
Matt Post, Radu Soricut, and Lucia Specia. 2012. Kenneth Heafield. 2011. KenLM: Faster and smaller
Findings of the 2012 workshop on statistical ma- language model queries. In Proceedings of the Sixth
chine translation. In Proceedings of the Seventh Workshop on Statistical Machine Translation, Edin-
Workshop on Statistical Machine Translation, pages burgh, UK, July. Association for Computational Lin-
10–51, Montréal, Canada, June. Association for guistics.
Computational Linguistics.
Mark Hopkins and Jonathan May. 2011. Tuning as
Ciprian Chelba and Johan Schalkwyk, 2013. Em- ranking. In Proceedings of the 2011 Conference on
pirical Exploration of Language Modeling for the Empirical Methods in Natural Language Process-
google.com Query Stream as Applied to Mobile ing, pages 1352–1362, Edinburgh, Scotland, July.
Voice Search, pages 197–229. Springer, New York.
Slava Katz. 1987. Estimation of probabilities from
sparse data for the language model component of a
Stanley Chen and Joshua Goodman. 1998. An em-
speech recognizer. IEEE Transactions on Acoustics,
pirical study of smoothing techniques for language
Speech, and Signal Processing, ASSP-35(3):400–
modeling. Technical Report TR-10-98, Harvard
401, March.
University, August.
Reinhard Kneser and Hermann Ney. 1995. Improved
Colin Cherry and George Foster. 2012. Batch tun- backing-off for m-gram language modeling. In
ing strategies for statistical machine translation. In Proceedings of the IEEE International Conference
Proceedings of the 2012 Conference of the North on Acoustics, Speech and Signal Processing, pages
American Chapter of the Association for Computa- 181–184.
tional Linguistics: Human Language Technologies,
pages 427–436. Association for Computational Lin- Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
guistics. Callison-Burch, Marcello Federico, Nicola Bertoldi,
Brooke Cowan, Wade Shen, Christine Moran,
Gordon V Cormack, Mark D Smucker, and Charles LA Richard Zens, Chris Dyer, Ondřej Bojar, Alexan-
Clarke. 2011. Efficient and effective spam filtering dra Constantin, and Evan Herbst. 2007. Moses:
and re-ranking for large web datasets. Information Open source toolkit for statistical machine transla-
retrieval, 14(5):441–465. tion. In Annual Meeting of the Association for Com-
putational Linguistics (ACL), Prague, Czech Repub-
Jeffrey Dean and Sanjay Ghemawat. 2004. MapRe- lic, June.
duce: Simplified data processing on large clusters.
In OSDI’04: Sixth Symposium on Operating Sys- Philipp Koehn. 2005. Europarl: A parallel corpus
tem Design and Implementation, San Francisco, CA, for statistical machine translation. In Proceedings
USA, 12. of MT Summit.

Roman Dementiev, Lutz Kettner, and Peter Sanders. Patrick Nguyen, Jianfeng Gao, and Milind Mahajan.
2008. STXXL: standard template library for XXL 2007. MSRLM: a scalable language modeling
data sets. Software: Practice and Experience, toolkit. Technical Report MSR-TR-2007-144, Mi-
38(6):589–637. crosoft Research.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. BLEU: A method for automatic
evalution of machine translation. In Proceedings
40th Annual Meeting of the Association for Com-
putational Linguistics, pages 311–318, Philadelphia,
PA, July.
Andreas Stolcke. 2002. SRILM - an extensible lan-
guage modeling toolkit. In Proceedings of the Sev-
enth International Conference on Spoken Language
Processing, pages 901–904.
David Talbot and Miles Osborne. 2007. Randomised
language modelling for statistical machine trans-
lation. In Proceedings of ACL, pages 512–519,
Prague, Czech Republic.
Edward Whittaker and Bhiksha Raj. 2001.
Quantization-based language model compres-
sion. In Proceedings of Eurospeech, pages 33–36,
Aalborg, Denmark, December.

4-Tolerant Retrieval
No ratings yet
4-Tolerant Retrieval
82 pages
Lecture 5p2 - Index Construction & Compressing
No ratings yet
Lecture 5p2 - Index Construction & Compressing
40 pages
GCMS-QP2010 User'sGuide (Ver2.5) PDF
No ratings yet
GCMS-QP2010 User'sGuide (Ver2.5) PDF
402 pages
AI Unit V
No ratings yet
AI Unit V
64 pages
2 Corpora and Smoothing
No ratings yet
2 Corpora and Smoothing
85 pages
05comp Flat
No ratings yet
05comp Flat
59 pages
9 Dictionaries and Tolerant Retrieval
No ratings yet
9 Dictionaries and Tolerant Retrieval
58 pages
05 - Dictionaries and Tuples
No ratings yet
05 - Dictionaries and Tuples
61 pages
NLP - Module 2
No ratings yet
NLP - Module 2
77 pages
ANSYS Stress Linearization
No ratings yet
ANSYS Stress Linearization
15 pages
Lecture 03
No ratings yet
Lecture 03
41 pages
Language Modeling: Introduction To N-Grams
No ratings yet
Language Modeling: Introduction To N-Grams
79 pages
Thesis Proposal Conceptual Framework
100% (2)
Thesis Proposal Conceptual Framework
8 pages
Pression
No ratings yet
Pression
44 pages
A Bit of Progress in Language Modeling
No ratings yet
A Bit of Progress in Language Modeling
73 pages
John V. Guttag - Introduction To Computation and Programming Using Python - With Application To Understanding Data-The MIT Press (2016) PDF
100% (1)
John V. Guttag - Introduction To Computation and Programming Using Python - With Application To Understanding Data-The MIT Press (2016) PDF
17 pages
Data Encryption Decryption
100% (2)
Data Encryption Decryption
60 pages
NLP MOD2 Advanced Smoothing Techniques
No ratings yet
NLP MOD2 Advanced Smoothing Techniques
41 pages
INS12 Hardware Description (01) (PDF) - EN
No ratings yet
INS12 Hardware Description (01) (PDF) - EN
8 pages
Notes of NLP - Unit-2
No ratings yet
Notes of NLP - Unit-2
23 pages
Practical Scientific Computing in Python A Workbook
No ratings yet
Practical Scientific Computing in Python A Workbook
43 pages
NLP Unit-V
No ratings yet
NLP Unit-V
30 pages
Lecture04-Ngram Lang Models
No ratings yet
Lecture04-Ngram Lang Models
39 pages
Lecture 3 - Language Modelling and RNNs Part 1
No ratings yet
Lecture 3 - Language Modelling and RNNs Part 1
44 pages
3-Lecture Three - (Chapter Two-N-gram Language Models)
No ratings yet
3-Lecture Three - (Chapter Two-N-gram Language Models)
28 pages
Speech Recognition Using Backoff N-Gram Modelling in Android Application
No ratings yet
Speech Recognition Using Backoff N-Gram Modelling in Android Application
7 pages
Ngrams Final
No ratings yet
Ngrams Final
28 pages
Pauls-Klein 2011 LM Paper
No ratings yet
Pauls-Klein 2011 LM Paper
10 pages
Irstlm Manual
No ratings yet
Irstlm Manual
8 pages
Lecture13 Ngrams With SRILM
No ratings yet
Lecture13 Ngrams With SRILM
6 pages
Daniel Lemire and Owen Kaser, Recursive Hashing and One-Pass, One-Hash N-Gram Count Estimation
No ratings yet
Daniel Lemire and Owen Kaser, Recursive Hashing and One-Pass, One-Hash N-Gram Count Estimation
35 pages
Introduction To Computational Linguistics: Eugene Charniak and Mark Johnson
No ratings yet
Introduction To Computational Linguistics: Eugene Charniak and Mark Johnson
148 pages
Working With Time - Lab Solutions Guide: Index Type Sourcetype Interesting Fields
No ratings yet
Working With Time - Lab Solutions Guide: Index Type Sourcetype Interesting Fields
10 pages
NLP Unit-II
No ratings yet
NLP Unit-II
20 pages
14 Ngramlm
No ratings yet
14 Ngramlm
67 pages
Kenlm: Faster and Smaller Language Model Queries
No ratings yet
Kenlm: Faster and Smaller Language Model Queries
11 pages
CME4408 P5 N-Grams Smooting
No ratings yet
CME4408 P5 N-Grams Smooting
43 pages
Equation Sheet
No ratings yet
Equation Sheet
4 pages
04 - N-Gram Language Models
No ratings yet
04 - N-Gram Language Models
41 pages
Language Models: CS6370: Natural Language Processing
No ratings yet
Language Models: CS6370: Natural Language Processing
35 pages
NLP Kneserney
No ratings yet
NLP Kneserney
10 pages
Machine Learning and Statistical Natural Language Processing
No ratings yet
Machine Learning and Statistical Natural Language Processing
27 pages
QQ QQ QQ An Efficient Language Model Using Double-Array StructuresDocument
No ratings yet
QQ QQ QQ An Efficient Language Model Using Double-Array StructuresDocument
11 pages
Installation Guide: DB2 Universal Database For OS/390
No ratings yet
Installation Guide: DB2 Universal Database For OS/390
576 pages
Final Compling
No ratings yet
Final Compling
5 pages
NLP L IA2
No ratings yet
NLP L IA2
23 pages
Unit 2b
No ratings yet
Unit 2b
22 pages
NLP Units Iv V
No ratings yet
NLP Units Iv V
30 pages
Lateral Earth Pressures For Seismic Design of Cantilever Retaining Walls
100% (2)
Lateral Earth Pressures For Seismic Design of Cantilever Retaining Walls
8 pages
PIC-P67J60 Development Board Users Manual: Rev. C, December 2009
100% (2)
PIC-P67J60 Development Board Users Manual: Rev. C, December 2009
18 pages
Introduction To The UPS Developer Kit
No ratings yet
Introduction To The UPS Developer Kit
33 pages
Growing An N-Gram Language Model
No ratings yet
Growing An N-Gram Language Model
6 pages
Geographical Information System
No ratings yet
Geographical Information System
19 pages
Secureworks Hacker Annualreport
No ratings yet
Secureworks Hacker Annualreport
25 pages
Deep Dive Aurora
No ratings yet
Deep Dive Aurora
55 pages
N Grams
No ratings yet
N Grams
51 pages
Text Representation: Lecture # 6
No ratings yet
Text Representation: Lecture # 6
21 pages
VISI Machining 5axis
No ratings yet
VISI Machining 5axis
2 pages
Sop Cognitive Science Iitk
0% (1)
Sop Cognitive Science Iitk
2 pages
GT PM 04 Powermeter Manual
100% (1)
GT PM 04 Powermeter Manual
5 pages
Unit Vapplications Notes
No ratings yet
Unit Vapplications Notes
13 pages
Lecture13 LM YirenWang
No ratings yet
Lecture13 LM YirenWang
8 pages
N Grams
No ratings yet
N Grams
3 pages
NLP Unit-V
No ratings yet
NLP Unit-V
30 pages
17 MAY - Algebra 2 Spring 2024 Final REVIEW
No ratings yet
17 MAY - Algebra 2 Spring 2024 Final REVIEW
9 pages
Artificial Intelligence: N-Gram Models: Russell & Norvig: Section 22.1
No ratings yet
Artificial Intelligence: N-Gram Models: Russell & Norvig: Section 22.1
32 pages
Statistics and Probability Reviewer
No ratings yet
Statistics and Probability Reviewer
10 pages
SAP BASIS CUA (New) CENTRAL USER ADMIN
No ratings yet
SAP BASIS CUA (New) CENTRAL USER ADMIN
13 pages
End Sem Answer Key 2023
No ratings yet
End Sem Answer Key 2023
4 pages
IHTSDO Editorial Guide 20160131
No ratings yet
IHTSDO Editorial Guide 20160131
171 pages
Segregation of Functions
No ratings yet
Segregation of Functions
2 pages
Java Programming Made Notes
No ratings yet
Java Programming Made Notes
6 pages
OTS Avaloq Parameterization Principles Agenda 3 1
No ratings yet
OTS Avaloq Parameterization Principles Agenda 3 1
7 pages
Cse291d 2 PDF
No ratings yet
Cse291d 2 PDF
54 pages
All Mcqs
No ratings yet
All Mcqs
34 pages
Application of Finite Automata Representing Large Vocabularies
No ratings yet
Application of Finite Automata Representing Large Vocabularies
27 pages
Assignment Ii: Case Ii The Strategy Behind Tiktok'S Global Rise
No ratings yet
Assignment Ii: Case Ii The Strategy Behind Tiktok'S Global Rise
7 pages
21bce1716 - DSA LAB ASSIGNMENT
No ratings yet
21bce1716 - DSA LAB ASSIGNMENT
10 pages
CRC Leaky Bucket Algorithm
No ratings yet
CRC Leaky Bucket Algorithm
7 pages
Chinese Segmentation With A Word-Based Perceptron Algorithm
No ratings yet
Chinese Segmentation With A Word-Based Perceptron Algorithm
8 pages
LTspice Change Log
No ratings yet
LTspice Change Log
2 pages
Some Pointers About HTML
No ratings yet
Some Pointers About HTML
3 pages
Data Warehouse ERD-3
No ratings yet
Data Warehouse ERD-3
1 page
Urban Planning and GIS
No ratings yet
Urban Planning and GIS
2 pages
Techniques and Tools for Artificial Intelligence. Neural Networks via R and PYTHON
From Everand
Techniques and Tools for Artificial Intelligence. Neural Networks via R and PYTHON
César Pérez López
No ratings yet
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
From Everand
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
Steve Brown
No ratings yet
Raster Graphics: Understanding the Foundations of Raster Graphics in Computer Vision
From Everand
Raster Graphics: Understanding the Foundations of Raster Graphics in Computer Vision
Fouad Sabry
No ratings yet
Breadth First Search: Fundamentals and Applications
From Everand
Breadth First Search: Fundamentals and Applications
Fouad Sabry
No ratings yet
All My IT Tech Posts
From Everand
All My IT Tech Posts
Stephen Edwards
No ratings yet
Digital Raster Graphic: Unveiling the Power of Digital Raster Graphics in Computer Vision
From Everand
Digital Raster Graphic: Unveiling the Power of Digital Raster Graphics in Computer Vision
Fouad Sabry
No ratings yet
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
From Everand
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
Joerg Christian Seubert
No ratings yet
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Scalable Modified Kneser-Ney Language Model Estimation

Uploaded by

Scalable Modified Kneser-Ney Language Model Estimation

Uploaded by

Scalable Modified Kneser-Ney Language Model Estimation

Kenneth Heafield∗,† Ivan Pouzyrevsky‡ Jonathan H. Clark† Philipp Koehn∗

Abstract MapReduce Steps Optimized

p(wn ) is computed before it is needed to compute

Much work has been done on efficient disk-based 6

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.