Abstract
In this paper we study the combined use of four different NLP toolkits—Stanford CoreNLP, GATE, OpenNLP and Twitter NLP tools—in the context of social media posts. Previous studies have shown performance comparisons between these tools, both on news and social media corporas. In this paper, we go further by trying to understand how differently these toolkits predict Named Entities, in terms of their precision and recall for three different entity types, and how they can complement each other in this task in order to achieve a combined performance superior to each individual one. Experiments on two publicly available datasets from the workshops WNUT-2015 and #MSM2013 show that using an ensemble of toolkits can improve the recognition of specific entity types - up to 10.62% for the entity type Person, 1.97% for the type Location and 1.31% for the type Organization, depending on the dataset and the criteria used for the voting. Our results also showed improvements of 3.76% and 1.69%, in each dataset respectively, on the average performance of the three entity types.
Similar content being viewed by others
References
Atdağ, S., Labatut, V.: A comparison of named entity recognition tools applied to biographical texts. In: 2013 2nd International Conference on Systems and Computer Science (ICSCS), pp. 228–233. IEEE (2013)
Baldwin, T., De Marneffe, M.C., Han, B., Kim, Y.-B., Ritter, A., Xu, W.: Shared tasks of the: Twitter lexical normalization and named entity recognition. In: Proceedings of the Workshop on Noisy User-generated Text (WNUT 2015), Beijing, China (2015)
Bontcheva, K., Derczynski, L., Funk, A., Greenwood, M.A., Maynard, D., Aswani, N.: Twitie: an open-source information extraction pipeline for microblog text. In: RANLP, pp. 83–90 (2013)
Cano Basave, A.E., Varga, A., Rowe, M., Stankovic, M., Dadzie, A.-S.: Making sense of microposts (# msm2013) concept extraction challenge (2013)
Clark, A., Fox, C., Lappin, S.: The Handbook of Computational Linguistics and Natural Language Processing. Wiley, Hoboken (2013)
Figueira, A., Sandim, M., Fortuna, P.: An approach to relevancy detection: contributions to the automatic detection of relevance in social networks. In: Rocha, A., Correia, A.M., Adeli, H., Reis, L.P., Teixeira, M.M. (eds.) ITEM 2014. AISC, vol. 444, pp. 89–99. Springer, Cham (2016). doi:10.1007/978-3-319-31232-3_9
Gate.ac.uk - wiki/twitie.html. https://gate.ac.uk/wiki/twitie.html. Accessed 06 Oct 2017
Jiang, R., Banchs, R.E., Li, H.: Evaluating and combining named entity recognition systems. In: ACL 2016, p. 21 (2016)
Laboreiro, G., Sarmento, L., Teixeira, J., Oliveira, E.: Tokenizing micro-blogging messages using a text classification approach. In: Proceedings of the Fourth Workshop on Analytics for Noisy Unstructured Text Data, pp. 81–88. ACM (2010)
C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard, and D. McClosky. The stanford corenlp natural language processing toolkit. In ACL (System Demonstrations), pp. 55–60 (2014)
Nebhi, K., Bontcheva, K., Gorrell, G.: Restoring capitalization in# tweets. In: Proceedings of the 24th International Conference on World Wide Web, pp. 1111–1115. ACM (2015)
Apache opennlp. https://opennlp.apache.org/. Accessed 06 Oct 2017
Pinto, A., Gonçalo Oliveira, H., Oliveira Alves, A.: Comparing the performance of different nlp toolkits in formal and social media text. In: OASIcs-OpenAccess Series in Informatics, vol. 51. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2016)
Ramshaw, L.A., Marcus, M.P.: Text chunking using transformation-based learning. In: Armstrong, S., Church, K., Isabelle, P., Manzi, S., Tzoukermann, E., Yarowsky, D. (eds.) Natural Language Processing Using Very Large Corpora, vol. 11, pp. 157–176. Springer, Heidelberg (1999). doi:10.1007/978-94-017-2390-9_10
Ritter, A., Clark, S., Etzioni, O., et al.: Named entity recognition in tweets: an experimental study. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1524–1534. Association for Computational Linguistics (2011)
Rodriquez, K.J., Bryant, M., Blanke, T., Luszczynska, M.: Comparison of named entity recognition tools for raw OCR text. In: KONVENS, pp. 410–414 (2012)
Saha, S., Ekbal, A.: Combining multiple classifiers using vote based classifier ensemble technique for named entity recognition. Data Knowl. Eng. 85, 15–39 (2013)
Wu, C.-W., Jan, S.-Y., Tsai, R.T.-H., Hsu, W.-L.: On using ensemble methods for Chinese named entity recognition. In: Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing, pp. 142–145 (2006)
Acknowledgments
This work is supported by the ERDF European Regional Development Fund through the COMPETE Programme (operational programme for competitiveness) and by National Funds through the FCT (Portuguese Foundation for Science and Technology) within project “Reminds/UTAP-ICDT/EEI-CTP/0022/2014”.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Batista, F., Figueira, Á. (2017). The Complementary Nature of Different NLP Toolkits for Named Entity Recognition in Social Media. In: Oliveira, E., Gama, J., Vale, Z., Lopes Cardoso, H. (eds) Progress in Artificial Intelligence. EPIA 2017. Lecture Notes in Computer Science(), vol 10423. Springer, Cham. https://doi.org/10.1007/978-3-319-65340-2_65
Download citation
DOI: https://doi.org/10.1007/978-3-319-65340-2_65
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-65339-6
Online ISBN: 978-3-319-65340-2
eBook Packages: Computer ScienceComputer Science (R0)