FastContext: Handling Out-of-Vocabulary Words Using the Word Structure and Context

Silva, Renato M.; Lochter, Johannes V.; Almeida, Tiago A.; Yamakami, Akebo

doi:10.1007/978-3-031-21689-3_38

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13654 ))

Included in the following conference series:

Brazilian Conference on Intelligent Systems

1046 Accesses

Abstract

Languages are dynamic, and new words or variations of existing ones appear over time. Also, the dictionary used by the distributed text representation models is limited. Therefore, methods that can handle unknown words (i.e., out-of-vocabulary – OOV) are essential for the quality of natural language processing systems. Although some techniques can handle OOV words, most of them are based only on one source of information (e.g., word structure or context) or rely on straightforward strategies unable to capture semantic or morphological information. In this study, we present FastContext, a method for handling OOV words that improves the embedding based on subword information returned by the state-of-the-art FastText through a context-based embedding computed by a deep learning model. We evaluated its performance using tasks of word similarity, named entity recognition, and part-of-speech tagging. FastContext performed better than FastText in scenarios where the context is the most relevant source to infer the meaning of the OOV words. Moreover, the results obtained by the proposed approach were better than the ones obtained by state-of-the-art OOV handling techniques, such as HiCE, Comick, and DistilBERT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Deep Learning Models for Representing Out-of-Vocabulary Words

Def2Vec: you shall know a word by its definition

Article Open access 16 October 2024

BiLSTM-CRF Manipuri NER with Character-Level Word Representation

Article 22 June 2022

Notes

1.
We remove any word shorter than three characters.
2.
PyTorch Github. Available at https://bit.ly/2B7LS3U, accessed on 2022/11/09 18:41:15.
3.
Transformers. Available at https://huggingface.co/transformers, accessed on 2022/11/09 18:41:15.
4.
HiCE. Available at https://github.com/acbull/HiCE, accessed on 2022/11/09 18:41:15.
5.
Keras. Available at https://keras.io/. Accessed on 2022/11/09 18:41:15.
6.
TensorFlow. Available at https://www.tensorflow.org/. Accessed on 2022/11/09 18:41:15.

References

Balasuriya, D., Ringland, N., Nothman, J., Murphy, T., Curran, J.R.: Named entity recognition in Wikipedia. In: Proceedings of the 2009 Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources (People’s Web), pp. 10–18. Association for Computational Linguistics, Suntec, Singapore, August 2009. https://www.aclweb.org/anthology/W09-3302
Baziotis, C., Pelekis, N., Doulkeridis, C.: DataStories at SemEval-2017 task 4: Deep LSTM with attention for message-level and topic-based sentiment analysis. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 747–754. Association for Computational Linguistics, Vancouver, Canada, August 2017. https://doi.org/10.18653/v1/S17-2126, https://www.aclweb.org/anthology/S17-2126
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Article Google Scholar
Bos, J., Basile, V., Evang, K., Venhuizen, N., Bjerva, J.: The groningen meaning bank. In: Ide, N., Pustejovsky, J. (eds.) Handbook of Linguistic Annotation, vol. 2, pp. 463–496. Springer (2017). https://doi.org/10.1007/978-94-024-0881-2_18
Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.: ELECTRA: pre-training text encoders as discriminators rather than generators. In: ICLR, pp. 1–18 (2020)
Google Scholar
Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
MathSciNet MATH Google Scholar
Derczynski, L., Nichols, E., van Erp, M., Limsopatham, N.: Results of the WNUT2017 shared task on novel and emerging entity recognition. In: Proceedings of the 3rd Workshop on Noisy User-generated Text, pp. 140–147. Association for Computational Linguistics, Copenhagen, Denmark, September 2017. https://doi.org/10.18653/v1/W17-4418
Flor, M., Fried, M., Rozovskaya, A.: A benchmark corpus of English misspellings and a minimally-supervised model for spelling correction. In: Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 76–86. Association for Computational Linguistics, Florence, Italy, August 2019. https://doi.org/10.18653/v1/W19-4407
Garneau, N., Leboeuf, J.S., Lamontagne, L.: Predicting and interpreting embeddings for out of vocabulary words in downstream tasks. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 331–333. Association for Computational Linguistics, Brussels, Belgium, November 2018. https://doi.org/10.18653/v1/W18-5439
Goldberg, Y., Hirst, G.: Neural Network Methods in Natural Language Processing. Morgan & Claypool Publishers (2017)
Google Scholar
Hu, Z., Chen, T., Chang, K.W., Sun, Y.: Few-shot representation learning for out-of-vocabulary words. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, ACL 2019, Florence, Italy, Jul 2019
Google Scholar
Khodak, M., Saunshi, N., Liang, Y., Ma, T., Stewart, B., Arora, S.: A la carte embedding: cheap but effective induction of semantic feature vectors. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 12–22. Association for Computational Linguistics, Melbourne, Australia, July 2018. https://doi.org/10.18653/v1/P18-1002
Kim, J.D., Ohta, T., Tsuruoka, Y., Tateisi, Y., Collier, N.: Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications, pp. 70–75. JNLPBA 2004. Association for Computational Linguistics, Stroudsburg, PA, USA (2004). http://dl.acm.org/citation.cfm?id=1567594.1567610
Kong, L., Schneider, N., Swayamdipta, S., Bhatia, A., Dyer, C., Smith, N.A.: A dependency parser for tweets. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1001–1012. Association for Computational Linguistics, Doha, Qatar, October 2014. https://doi.org/10.3115/v1/D14-1108, https://www.aclweb.org/anthology/D14-1108
Lazaridou, A., Marelli, M., Zamparelli, R., Baroni, M.: Compositionally derived representations of morphologically complex words in distributional semantics. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1517–1526. Association for Computational Linguistics, Sofia, Bulgaria, August 2013. https://www.aclweb.org/anthology/P13-1149
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. CoRR arXiv:1907.11692 (2019)
Lochter, J.V., Pires, P.R., Bossolani, C., Yamakami, A., Almeida, T.A.: Evaluating the impact of corpora used to train distributed text representation models for noisy and short texts. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8, July 2018
Google Scholar
Lochter, J.V., Silva, R.M., Almeida, T.A.: Deep learning models for representing out-of-vocabulary words. In: Proceedings of the 9th Brazilian Conference on Intelligent Systems (BRACIS 2020), pp. 418–434. Springer International Publishing, Rio Grande, RS, Brazil, October 2020. https://doi.org/10.1007/978-3-030-61377-8_29
Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of English: The Penn treebank. Comput. Linguist. 19(2), 313–330 (1993)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS 2013), pp. 3111–3119. Curran Associates Inc., Lake Tahoe, Nevada, USA (2013)
Google Scholar
Ohta, T., Pyysalo, S., Tsujii, J., Ananiadou, S.: Open-domain anatomical entity mention detection. In: Proceedings of the Workshop on Detecting Structure in Scholarly Discourse, pp. 27–36. ACL 2012. Association for Computational Linguistics, Jeju Island, Korea, July 2012
Google Scholar
Pinter, Y., Guthrie, R., Eisenstein, J.: Mimicking word embeddings using subword RNNs. CoRR arXiv:1707.06961 (2017)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI Blog (2019)
Google Scholar
Ritter, A., Clark, S., Mausam, Etzioni, O.: Named entity recognition in tweets: an experimental study. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 1524–1534. Association for Computational Linguistics, Edinburgh, Scotland, UK, July 2011
Google Scholar
Rushton, E.: A simple spell checker built from word vectors (2018). https://blog.usejournal.com/a-simple-spell-checker-built-from-word-vectors-9f28452b6f26
Alvarado, J.C.S., Verspoor, K., Baldwin, T.: Domain adaption of named entity recognition to support credit risk assessment. In: Proceedings of the Australasian Language Technology Association Workshop 2015, pp. 84–90, Parramatta, Australia, December 2015
Google Scholar
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter (2019)
Google Scholar
Sundermeyer, M., Schlüter, R., Ney, H.: LSTM neural networks for language modeling. In: INTERSPEECH, pp. 1–4 (2012)
Google Scholar
Tateisi, Y., Tsujii, J.I.: Part-of-speech annotation of biology research abstracts. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004). European Language Resources Association (ELRA), Lisbon, Portugal, May 2004
Google Scholar
Sang, E.F.T.K., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4, pp. 142–147. CONLL 2003. Association for Computational Linguistics, USA (2003). https://doi.org/10.3115/1119176.1119195
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30, pp. 5998–6008. Curran Associates, Inc. (2017). http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
Yang, J., Leskovec, J.: Patterns of temporal variation in online media. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, pp. 177–186. WSDM 2011. ACM, New York, NY, USA (2011)
Google Scholar
Yang, X., Macdonald, C., Ounis, I.: Using word embeddings in twitter election classification. Inf. Retr. 21(2–3), 183–207 (2018). https://doi.org/10.1007/s10791-017-9319-5
Article Google Scholar

Download references

Acknowledgments

We gratefully acknowledge the support provided by the São Paulo Research Foundation (FAPESP; grant #2018/02146-6).

Author information

Authors and Affiliations

Department of Computer Science, Federal University of São Carlos (UFSCar), Sorocaba, São Paulo, Brazil
Renato M. Silva & Tiago A. Almeida
Department of Systems and Energy, University of Campinas (UNICAMP), Campinas, São Paulo, Brazil
Johannes V. Lochter & Akebo Yamakami
Department of Computer Engineering, Facens University, Sorocaba, São Paulo, Brazil
Renato M. Silva & Johannes V. Lochter

Authors

Renato M. Silva
View author publications
You can also search for this author in PubMed Google Scholar
Johannes V. Lochter
View author publications
You can also search for this author in PubMed Google Scholar
Tiago A. Almeida
View author publications
You can also search for this author in PubMed Google Scholar
Akebo Yamakami
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Renato M. Silva .

Editor information

Editors and Affiliations

Federal University of Rio Grande do Norte, Natal, Brazil
João Carlos Xavier-Junior
Federal University of Bahia, Salvador, Brazil
Ricardo Araújo Rios

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Silva, R.M., Lochter, J.V., Almeida, T.A., Yamakami, A. (2022). FastContext: Handling Out-of-Vocabulary Words Using the Word Structure and Context. In: Xavier-Junior, J.C., Rios, R.A. (eds) Intelligent Systems. BRACIS 2022. Lecture Notes in Computer Science(), vol 13654 . Springer, Cham. https://doi.org/10.1007/978-3-031-21689-3_38

Download citation

DOI: https://doi.org/10.1007/978-3-031-21689-3_38
Published: 19 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21688-6
Online ISBN: 978-3-031-21689-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

FastContext: Handling Out-of-Vocabulary Words Using the Word Structure and Context

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Deep Learning Models for Representing Out-of-Vocabulary Words

Def2Vec: you shall know a word by its definition

BiLSTM-CRF Manipuri NER with Character-Level Word Representation

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

FastContext: Handling Out-of-Vocabulary Words Using the Word Structure and Context

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Deep Learning Models for Representing Out-of-Vocabulary Words

Def2Vec: you shall know a word by its definition

BiLSTM-CRF Manipuri NER with Character-Level Word Representation

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.