Scalable Modified Kneser-Ney Language Model Estimation
Scalable Modified Kneser-Ney Language Model Estimation
RAM (GB)
jointly iterating through N streams, one for each
30
length of n-gram. The relevant pseudo probability
u(wn |w1n−1 ) and backoff b(w1n−1 ) appear in the
input records (Equation 1). 20 SRI
SRI compact
3.5 Joining 10 IRST
This work
The last task is to unite b(w1n ) computed in §3.3
with p(wn |w1n−1 ) computed in §3.4 for storage in 0
0 200 400 600 800 1000
the model. We note that interpolation (Equation 2) Tokens (millions)
used the different backoff b(w1n−1 ) and so b(w1n )
is not immediately available. However, the back-
Figure 4: Peak virtual memory usage.
off values were saved in suffix order (§3.3) and in-
terpolation produces probabilities in suffix order.
During the same streaming pass as interpolation, 14
SRI
we merge the two streams.5 Suffix order is also 12 SRI compact
convenient because the popular reverse trie data IRST
structure can be built in the same pass.6 10 This work
CPU time (hours)
4 Sorting 8
Dina Bitton and David J DeWitt. 1983. Duplicate David Guthrie and Mark Hepple. 2010. Storing the
record elimination in large data files. ACM Trans- web in memory: Space efficient language mod-
actions on database systems (TODS), 8(2):255–265. els with constant time retrieval. In Proceedings of
EMNLP 2010, Los Angeles, CA.
Thorsten Brants, Ashok C. Popat, Peng Xu, Franz J.
Och, and Jeffrey Dean. 2007. Large language Kenneth Heafield, Philipp Koehn, and Alon Lavie.
models in machine translation. In Proceedings of 2012. Language model rest costs and space-efficient
the 2007 Joint Conference on Empirical Methods storage. In Proceedings of the 2012 Joint Confer-
in Natural Language Processing and Computational ence on Empirical Methods in Natural Language
Language Learning, pages 858–867, June. Processing and Computational Natural Language
Learning, Jeju Island, Korea.
Chris Callison-Burch, Philipp Koehn, Christof Monz,
Matt Post, Radu Soricut, and Lucia Specia. 2012. Kenneth Heafield. 2011. KenLM: Faster and smaller
Findings of the 2012 workshop on statistical ma- language model queries. In Proceedings of the Sixth
chine translation. In Proceedings of the Seventh Workshop on Statistical Machine Translation, Edin-
Workshop on Statistical Machine Translation, pages burgh, UK, July. Association for Computational Lin-
10–51, Montréal, Canada, June. Association for guistics.
Computational Linguistics.
Mark Hopkins and Jonathan May. 2011. Tuning as
Ciprian Chelba and Johan Schalkwyk, 2013. Em- ranking. In Proceedings of the 2011 Conference on
pirical Exploration of Language Modeling for the Empirical Methods in Natural Language Process-
google.com Query Stream as Applied to Mobile ing, pages 1352–1362, Edinburgh, Scotland, July.
Voice Search, pages 197–229. Springer, New York.
Slava Katz. 1987. Estimation of probabilities from
sparse data for the language model component of a
Stanley Chen and Joshua Goodman. 1998. An em-
speech recognizer. IEEE Transactions on Acoustics,
pirical study of smoothing techniques for language
Speech, and Signal Processing, ASSP-35(3):400–
modeling. Technical Report TR-10-98, Harvard
401, March.
University, August.
Reinhard Kneser and Hermann Ney. 1995. Improved
Colin Cherry and George Foster. 2012. Batch tun- backing-off for m-gram language modeling. In
ing strategies for statistical machine translation. In Proceedings of the IEEE International Conference
Proceedings of the 2012 Conference of the North on Acoustics, Speech and Signal Processing, pages
American Chapter of the Association for Computa- 181–184.
tional Linguistics: Human Language Technologies,
pages 427–436. Association for Computational Lin- Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
guistics. Callison-Burch, Marcello Federico, Nicola Bertoldi,
Brooke Cowan, Wade Shen, Christine Moran,
Gordon V Cormack, Mark D Smucker, and Charles LA Richard Zens, Chris Dyer, Ondřej Bojar, Alexan-
Clarke. 2011. Efficient and effective spam filtering dra Constantin, and Evan Herbst. 2007. Moses:
and re-ranking for large web datasets. Information Open source toolkit for statistical machine transla-
retrieval, 14(5):441–465. tion. In Annual Meeting of the Association for Com-
putational Linguistics (ACL), Prague, Czech Repub-
Jeffrey Dean and Sanjay Ghemawat. 2004. MapRe- lic, June.
duce: Simplified data processing on large clusters.
In OSDI’04: Sixth Symposium on Operating Sys- Philipp Koehn. 2005. Europarl: A parallel corpus
tem Design and Implementation, San Francisco, CA, for statistical machine translation. In Proceedings
USA, 12. of MT Summit.
Roman Dementiev, Lutz Kettner, and Peter Sanders. Patrick Nguyen, Jianfeng Gao, and Milind Mahajan.
2008. STXXL: standard template library for XXL 2007. MSRLM: a scalable language modeling
data sets. Software: Practice and Experience, toolkit. Technical Report MSR-TR-2007-144, Mi-
38(6):589–637. crosoft Research.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jing Zhu. 2002. BLEU: A method for automatic
evalution of machine translation. In Proceedings
40th Annual Meeting of the Association for Com-
putational Linguistics, pages 311–318, Philadelphia,
PA, July.
Andreas Stolcke. 2002. SRILM - an extensible lan-
guage modeling toolkit. In Proceedings of the Sev-
enth International Conference on Spoken Language
Processing, pages 901–904.
David Talbot and Miles Osborne. 2007. Randomised
language modelling for statistical machine trans-
lation. In Proceedings of ACL, pages 512–519,
Prague, Czech Republic.
Edward Whittaker and Bhiksha Raj. 2001.
Quantization-based language model compres-
sion. In Proceedings of Eurospeech, pages 33–36,
Aalborg, Denmark, December.