Skip to main content

FastContext: Handling Out-of-Vocabulary Words Using the Word Structure and Context

  • Conference paper
  • First Online:
Intelligent Systems (BRACIS 2022)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13654 ))

Included in the following conference series:

  • 1046 Accesses

Abstract

Languages are dynamic, and new words or variations of existing ones appear over time. Also, the dictionary used by the distributed text representation models is limited. Therefore, methods that can handle unknown words (i.e., out-of-vocabulary – OOV) are essential for the quality of natural language processing systems. Although some techniques can handle OOV words, most of them are based only on one source of information (e.g., word structure or context) or rely on straightforward strategies unable to capture semantic or morphological information. In this study, we present FastContext, a method for handling OOV words that improves the embedding based on subword information returned by the state-of-the-art FastText through a context-based embedding computed by a deep learning model. We evaluated its performance using tasks of word similarity, named entity recognition, and part-of-speech tagging. FastContext performed better than FastText in scenarios where the context is the most relevant source to infer the meaning of the OOV words. Moreover, the results obtained by the proposed approach were better than the ones obtained by state-of-the-art OOV handling techniques, such as HiCE, Comick, and DistilBERT.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    We remove any word shorter than three characters.

  2. 2.

    PyTorch Github. Available at https://bit.ly/2B7LS3U, accessed on 2022/11/09 18:41:15.

  3. 3.

    Transformers. Available at https://huggingface.co/transformers, accessed on 2022/11/09 18:41:15.

  4. 4.

    HiCE. Available at https://github.com/acbull/HiCE, accessed on 2022/11/09 18:41:15.

  5. 5.

    Keras. Available at https://keras.io/. Accessed on 2022/11/09 18:41:15.

  6. 6.

    TensorFlow. Available at https://www.tensorflow.org/. Accessed on 2022/11/09 18:41:15.

References

  1. Balasuriya, D., Ringland, N., Nothman, J., Murphy, T., Curran, J.R.: Named entity recognition in Wikipedia. In: Proceedings of the 2009 Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources (People’s Web), pp. 10–18. Association for Computational Linguistics, Suntec, Singapore, August 2009. https://www.aclweb.org/anthology/W09-3302

  2. Baziotis, C., Pelekis, N., Doulkeridis, C.: DataStories at SemEval-2017 task 4: Deep LSTM with attention for message-level and topic-based sentiment analysis. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 747–754. Association for Computational Linguistics, Vancouver, Canada, August 2017. https://doi.org/10.18653/v1/S17-2126, https://www.aclweb.org/anthology/S17-2126

  3. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)

    Article  Google Scholar 

  4. Bos, J., Basile, V., Evang, K., Venhuizen, N., Bjerva, J.: The groningen meaning bank. In: Ide, N., Pustejovsky, J. (eds.) Handbook of Linguistic Annotation, vol. 2, pp. 463–496. Springer (2017). https://doi.org/10.1007/978-94-024-0881-2_18

  5. Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.: ELECTRA: pre-training text encoders as discriminators rather than generators. In: ICLR, pp. 1–18 (2020)

    Google Scholar 

  6. Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)

    MathSciNet  MATH  Google Scholar 

  7. Derczynski, L., Nichols, E., van Erp, M., Limsopatham, N.: Results of the WNUT2017 shared task on novel and emerging entity recognition. In: Proceedings of the 3rd Workshop on Noisy User-generated Text, pp. 140–147. Association for Computational Linguistics, Copenhagen, Denmark, September 2017. https://doi.org/10.18653/v1/W17-4418

  8. Flor, M., Fried, M., Rozovskaya, A.: A benchmark corpus of English misspellings and a minimally-supervised model for spelling correction. In: Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 76–86. Association for Computational Linguistics, Florence, Italy, August 2019. https://doi.org/10.18653/v1/W19-4407

  9. Garneau, N., Leboeuf, J.S., Lamontagne, L.: Predicting and interpreting embeddings for out of vocabulary words in downstream tasks. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 331–333. Association for Computational Linguistics, Brussels, Belgium, November 2018. https://doi.org/10.18653/v1/W18-5439

  10. Goldberg, Y., Hirst, G.: Neural Network Methods in Natural Language Processing. Morgan & Claypool Publishers (2017)

    Google Scholar 

  11. Hu, Z., Chen, T., Chang, K.W., Sun, Y.: Few-shot representation learning for out-of-vocabulary words. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, ACL 2019, Florence, Italy, Jul 2019

    Google Scholar 

  12. Khodak, M., Saunshi, N., Liang, Y., Ma, T., Stewart, B., Arora, S.: A la carte embedding: cheap but effective induction of semantic feature vectors. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 12–22. Association for Computational Linguistics, Melbourne, Australia, July 2018. https://doi.org/10.18653/v1/P18-1002

  13. Kim, J.D., Ohta, T., Tsuruoka, Y., Tateisi, Y., Collier, N.: Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications, pp. 70–75. JNLPBA 2004. Association for Computational Linguistics, Stroudsburg, PA, USA (2004). http://dl.acm.org/citation.cfm?id=1567594.1567610

  14. Kong, L., Schneider, N., Swayamdipta, S., Bhatia, A., Dyer, C., Smith, N.A.: A dependency parser for tweets. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1001–1012. Association for Computational Linguistics, Doha, Qatar, October 2014. https://doi.org/10.3115/v1/D14-1108, https://www.aclweb.org/anthology/D14-1108

  15. Lazaridou, A., Marelli, M., Zamparelli, R., Baroni, M.: Compositionally derived representations of morphologically complex words in distributional semantics. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1517–1526. Association for Computational Linguistics, Sofia, Bulgaria, August 2013. https://www.aclweb.org/anthology/P13-1149

  16. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. CoRR arXiv:1907.11692 (2019)

  17. Lochter, J.V., Pires, P.R., Bossolani, C., Yamakami, A., Almeida, T.A.: Evaluating the impact of corpora used to train distributed text representation models for noisy and short texts. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 1–8, July 2018

    Google Scholar 

  18. Lochter, J.V., Silva, R.M., Almeida, T.A.: Deep learning models for representing out-of-vocabulary words. In: Proceedings of the 9th Brazilian Conference on Intelligent Systems (BRACIS 2020), pp. 418–434. Springer International Publishing, Rio Grande, RS, Brazil, October 2020. https://doi.org/10.1007/978-3-030-61377-8_29

  19. Marcus, M.P., Marcinkiewicz, M.A., Santorini, B.: Building a large annotated corpus of English: The Penn treebank. Comput. Linguist. 19(2), 313–330 (1993)

    Google Scholar 

  20. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS 2013), pp. 3111–3119. Curran Associates Inc., Lake Tahoe, Nevada, USA (2013)

    Google Scholar 

  21. Ohta, T., Pyysalo, S., Tsujii, J., Ananiadou, S.: Open-domain anatomical entity mention detection. In: Proceedings of the Workshop on Detecting Structure in Scholarly Discourse, pp. 27–36. ACL 2012. Association for Computational Linguistics, Jeju Island, Korea, July 2012

    Google Scholar 

  22. Pinter, Y., Guthrie, R., Eisenstein, J.: Mimicking word embeddings using subword RNNs. CoRR arXiv:1707.06961 (2017)

  23. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI Blog (2019)

    Google Scholar 

  24. Ritter, A., Clark, S., Mausam, Etzioni, O.: Named entity recognition in tweets: an experimental study. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 1524–1534. Association for Computational Linguistics, Edinburgh, Scotland, UK, July 2011

    Google Scholar 

  25. Rushton, E.: A simple spell checker built from word vectors (2018). https://blog.usejournal.com/a-simple-spell-checker-built-from-word-vectors-9f28452b6f26

  26. Alvarado, J.C.S., Verspoor, K., Baldwin, T.: Domain adaption of named entity recognition to support credit risk assessment. In: Proceedings of the Australasian Language Technology Association Workshop 2015, pp. 84–90, Parramatta, Australia, December 2015

    Google Scholar 

  27. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter (2019)

    Google Scholar 

  28. Sundermeyer, M., Schlüter, R., Ney, H.: LSTM neural networks for language modeling. In: INTERSPEECH, pp. 1–4 (2012)

    Google Scholar 

  29. Tateisi, Y., Tsujii, J.I.: Part-of-speech annotation of biology research abstracts. In: Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC 2004). European Language Resources Association (ELRA), Lisbon, Portugal, May 2004

    Google Scholar 

  30. Sang, E.F.T.K., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 - Volume 4, pp. 142–147. CONLL 2003. Association for Computational Linguistics, USA (2003). https://doi.org/10.3115/1119176.1119195

  31. Vaswani, A., et al.: Attention is all you need. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30, pp. 5998–6008. Curran Associates, Inc. (2017). http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

  32. Yang, J., Leskovec, J.: Patterns of temporal variation in online media. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, pp. 177–186. WSDM 2011. ACM, New York, NY, USA (2011)

    Google Scholar 

  33. Yang, X., Macdonald, C., Ounis, I.: Using word embeddings in twitter election classification. Inf. Retr. 21(2–3), 183–207 (2018). https://doi.org/10.1007/s10791-017-9319-5

    Article  Google Scholar 

Download references

Acknowledgments

We gratefully acknowledge the support provided by the São Paulo Research Foundation (FAPESP; grant #2018/02146-6).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Renato M. Silva .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Silva, R.M., Lochter, J.V., Almeida, T.A., Yamakami, A. (2022). FastContext: Handling Out-of-Vocabulary Words Using the Word Structure and Context. In: Xavier-Junior, J.C., Rios, R.A. (eds) Intelligent Systems. BRACIS 2022. Lecture Notes in Computer Science(), vol 13654 . Springer, Cham. https://doi.org/10.1007/978-3-031-21689-3_38

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-21689-3_38

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-21688-6

  • Online ISBN: 978-3-031-21689-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy