Abstract
Languages are dynamic, and new words or variations of existing ones appear over time. Also, the dictionary used by the distributed text representation models is limited. Therefore, methods that can handle unknown words (i.e., out-of-vocabulary – OOV) are essential for the quality of natural language processing systems. Although some techniques can handle OOV words, most of them are based only on one source of information (e.g., word structure or context) or rely on straightforward strategies unable to capture semantic or morphological information. In this study, we present FastContext, a method for handling OOV words that improves the embedding based on subword information returned by the state-of-the-art FastText through a context-based embedding computed by a deep learning model. We evaluated its performance using tasks of word similarity, named entity recognition, and part-of-speech tagging. FastContext performed better than FastText in scenarios where the context is the most relevant source to infer the meaning of the OOV words. Moreover, the results obtained by the proposed approach were better than the ones obtained by state-of-the-art OOV handling techniques, such as HiCE, Comick, and DistilBERT.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
We remove any word shorter than three characters.
- 2.
PyTorch Github. Available at https://bit.ly/2B7LS3U, accessed on 2022/11/09 18:41:15.
- 3.
Transformers. Available at https://huggingface.co/transformers, accessed on 2022/11/09 18:41:15.
- 4.
HiCE. Available at https://github.com/acbull/HiCE, accessed on 2022/11/09 18:41:15.
- 5.
Keras. Available at https://keras.io/. Accessed on 2022/11/09 18:41:15.
- 6.
TensorFlow. Available at https://www.tensorflow.org/. Accessed on 2022/11/09 18:41:15.
References
Balasuriya, D., Ringland, N., Nothman, J., Murphy, T., Curran, J.R.: Named entity recognition in Wikipedia. In: Proceedings of the 2009 Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources (People’s Web), pp. 10–18. Association for Computational Linguistics, Suntec, Singapore, August 2009. https://www.aclweb.org/anthology/W09-3302
Baziotis, C., Pelekis, N., Doulkeridis, C.: DataStories at SemEval-2017 task 4: Deep LSTM with attention for message-level and topic-based sentiment analysis. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 747–754. Association for Computational Linguistics, Vancouver, Canada, August 2017. https://doi.org/10.18653/v1/S17-2126, https://www.aclweb.org/anthology/S17-2126
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Bos, J., Basile, V., Evang, K., Venhuizen, N., Bjerva, J.: The groningen meaning bank. In: Ide, N., Pustejovsky, J. (eds.) Handbook of Linguistic Annotation, vol. 2, pp. 463–496. Springer (2017). https://doi.org/10.1007/978-94-024-0881-2_18
Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.: ELECTRA: pre-training text encoders as discriminators rather than generators. In: ICLR, pp. 1–18 (2020)
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
Derczynski, L., Nichols, E., van Erp, M., Limsopatham, N.: Results of the WNUT2017 shared task on novel and emerging entity recognition. In: Proceedings of the 3rd Workshop on Noisy User-generated Text, pp. 140–147. Association for Computational Linguistics, Copenhagen, Denmark, September 2017. https://doi.org/10.18653/v1/W17-4418
Flor, M., Fried, M., Rozovskaya, A.: A benchmark corpus of English misspellings and a minimally-supervised model for spelling correction. In: Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 76–86. Association for Computational Linguistics, Florence, Italy, August 2019. https://doi.org/10.18653/v1/W19-4407
Garneau, N., Leboeuf, J.S., Lamontagne, L.: Predicting and interpreting embeddings for out of vocabulary words in downstream tasks. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 331–333. Association for Computational Linguistics, Brussels, Belgium, November 2018. https://doi.org/10.18653/v1/W18-5439
Goldberg, Y., Hirst, G.: Neural Network Methods in Natural Language Processing. Morgan & Claypool Publishers (2017)
Hu, Z., Chen, T., Chang, K.W., Sun, Y.: Few-shot representation learning for out-of-vocabulary words. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, ACL 2019, Florence, Italy, Jul 2019
Khodak, M., Saunshi, N., Liang, Y., Ma, T., Stewart, B., Arora, S.: A la carte embedding: cheap but effective induction of semantic feature vectors. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 12–22. Association for Computational Linguistics, Melbourne, Australia, July 2018. https://doi.org/10.18653/v1/P18-1002
Kim, J.D., Ohta, T., Tsuruoka, Y., Tateisi, Y., Collier, N.: Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications, pp. 70–75. JNLPBA 2004. Association for Computational Linguistics, Stroudsburg, PA, USA (2004). http://dl.acm.org/citation.cfm?id=1567594.1567610
Kong, L., Schneider, N., Swayamdipta, S., Bhatia, A., Dyer, C., Smith, N.A.: A dependency parser for tweets. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1001–1012. Association for Computational Linguistics, Doha, Qatar, October 2014. https://doi.org/10.3115/v1/D14-1108, https://www.aclweb.org/anthology/D14-1108
Lazaridou, A., Marelli, M., Zamparelli, R., Baroni, M.: Compositionally derived representations of morphologically complex words in distributional semantics. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1517–1526. Association for Computational Linguistics, Sofia, Bulgaria, August 2013. https://www.aclweb.org/anthology/P13-1149
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. CoRR arXiv:1907.11692 (2019)
Lochter, J.V., Pires, P.R., Bossolani, C., Yamakami, A., Almeida, T.A.: Evaluating the impact of corpora used to train distributed text representation models for noisy and short texts. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8, July 2018
Lochter, J.V., Silva, R.M., Almeida, T.A.: Deep learning models for representing out-of-vocabulary words. In: Proceedings of the 9th Brazilian Conference on Intelligent Systems (BRACIS 2020), pp. 418–434. Springer International Publishing, Rio Grande, RS, Brazil, October 2020. https://doi.org/10.1007/978-3-030-61377-8_29
Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of English: The Penn treebank. Comput. Linguist. 19(2), 313–330 (1993)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS 2013), pp. 3111–3119. Curran Associates Inc., Lake Tahoe, Nevada, USA (2013)
Ohta, T., Pyysalo, S., Tsujii, J., Ananiadou, S.: Open-domain anatomical entity mention detection. In: Proceedings of the Workshop on Detecting Structure in Scholarly Discourse, pp. 27–36. ACL 2012. Association for Computational Linguistics, Jeju Island, Korea, July 2012
Pinter, Y., Guthrie, R., Eisenstein, J.: Mimicking word embeddings using subword RNNs. CoRR arXiv:1707.06961 (2017)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI Blog (2019)
Ritter, A., Clark, S., Mausam, Etzioni, O.: Named entity recognition in tweets: an experimental study. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 1524–1534. Association for Computational Linguistics, Edinburgh, Scotland, UK, July 2011
Rushton, E.: A simple spell checker built from word vectors (2018). https://blog.usejournal.com/a-simple-spell-checker-built-from-word-vectors-9f28452b6f26
Alvarado, J.C.S., Verspoor, K., Baldwin, T.: Domain adaption of named entity recognition to support credit risk assessment. In: Proceedings of the Australasian Language Technology Association Workshop 2015, pp. 84–90, Parramatta, Australia, December 2015
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter (2019)
Sundermeyer, M., Schlüter, R., Ney, H.: LSTM neural networks for language modeling. In: INTERSPEECH, pp. 1–4 (2012)
Tateisi, Y., Tsujii, J.I.: Part-of-speech annotation of biology research abstracts. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004). European Language Resources Association (ELRA), Lisbon, Portugal, May 2004
Sang, E.F.T.K., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4, pp. 142–147. CONLL 2003. Association for Computational Linguistics, USA (2003). https://doi.org/10.3115/1119176.1119195
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30, pp. 5998–6008. Curran Associates, Inc. (2017). http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
Yang, J., Leskovec, J.: Patterns of temporal variation in online media. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, pp. 177–186. WSDM 2011. ACM, New York, NY, USA (2011)
Yang, X., Macdonald, C., Ounis, I.: Using word embeddings in twitter election classification. Inf. Retr. 21(2–3), 183–207 (2018). https://doi.org/10.1007/s10791-017-9319-5
Acknowledgments
We gratefully acknowledge the support provided by the São Paulo Research Foundation (FAPESP; grant #2018/02146-6).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Silva, R.M., Lochter, J.V., Almeida, T.A., Yamakami, A. (2022). FastContext: Handling Out-of-Vocabulary Words Using the Word Structure and Context. In: Xavier-Junior, J.C., Rios, R.A. (eds) Intelligent Systems. BRACIS 2022. Lecture Notes in Computer Science(), vol 13654 . Springer, Cham. https://doi.org/10.1007/978-3-031-21689-3_38
Download citation
DOI: https://doi.org/10.1007/978-3-031-21689-3_38
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21688-6
Online ISBN: 978-3-031-21689-3
eBook Packages: Computer ScienceComputer Science (R0)