Abstract
Web crawlers account for more than a third of the total web traffic and they are threatening the security, privacy and veracity of web applications and their users. Businesses in finance, ticketing, and publishing, as well as websites with rich and unique content are the ones mostly affected by their actions. To deal with this problem, we present a novel web robot detection approach that takes advantage of the content of a website based on the assumption that human web users are interested in specific topics, while web robots crawl the web randomly. Our approach extends the typical user session representation of log-based features with a novel set of features that capture the semantics of the content of the requested resources. In addition, we contribute a new real-world dataset, which we make publicly available, towards alleviating the scarcity of open data in this field. Empirical results on this dataset validate our assumption and show that our approach outranks state-of-the-art methods for web robot detection.






Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The country codes are obtained from the IP address using the Geolite2 database (https://dev.maxmind.com/geoip/geoip2/geolite2/)
https://browscap.org/ - Version 6000031
https://github.com/atmire/COUNTER-Robots - Accessed 28-Mar-2019
https://www.projectcounter.org - Accessed 15-July-2019
https://bit.ly/2XSDjzI - Accessed 28-Mar-2019
https://matomo.org - Accessed 15-July-2019
https://www.readcube.com/papers/ - Accessed 27-March-2019
https://hc.apache.org - Accessed 27-March-2019
References
AlNoamany YA, Weigle MC, Nelson ML (2013) Access patterns for robots and humans in web archives. In: Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries. ACM, pp 339–348
Ansari Z A, Sattar S A, Babu A V (2017) A fuzzy neural network based framework to discover user access patterns from web log data. ADAC 11(3):519–546
Blei D M, Ng A Y, Jordan M I (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146
Bomhardt C, Gaul W, Schmidt-Thieme L (2005) Web robot detection-preprocessing web logfiles for robot detection. In: New developments in classification and data analysis. Springer, pp 113–124
Brown K, Doran D (2018) Contrasting web robot and human behaviors with network models. arXiv:180109715
Chen T, Guestrin C (2016) Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. ACM, pp 785–794
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:181004805
Doran D, Gokhale S S (2016) An integrated method for real time and offline web robot detection. Expert Syst 33(6):592–606
Doran D, Morillo K, Gokhale S S (2013) A comparison of web robot and human requests. In: Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. ACM, pp 1374–1380
Dots G (2018) 2018 bad bot report. https://www.globaldots.com/bad-bot-report-2018/, (Last accessed 11-June-2019)
Ferrara E, Varol O, Davis C, Menczer F, Flammini A (2016) The rise of social bots. Commun ACM 59(7):96–104
Foundation O (2018) Owasp automated threat handbook web application version 1.2. https://www.owasp.org/index.php/File:Automated-threat-handbook.pdf, (Last accessed 20-September-2018)
Greene J W (2016) Web robot detection in scholarly open access institutional repositories. Library Hi Tech 34(3):500–520
Hamidzadeh J, Zabihimayvan M, Sadeghi R (2018) Detection of web site visitors based on fuzzy rough sets. Soft Comput 22(7):2175–2188
Kang H, Wang K, Soukal D, Behr F, Zheng Z (2010) Large-scale bot detection for search engines. In: Proceedings of the 19th international conference on World wide web. ACM, pp 501–510
Kwon S, Kim YG, Cha S (2012a) Web robot detection based on pattern-matching technique. J Inf Sci 38(2):118–126
Kwon S, Oh M, Kim D, Lee J, Kim YG, Cha S (2012b) Web Robot Detection based on Monotonous Behavior. Proceedings of the Information Science and Industrial Applications, pp 43–48
Lagopoulos A, Tsoumakas G, Papadopoulos G (2018) Web Robot detection: A semantic approach. In: 2018 IEEE 30Th international conference on tools with artificial intelligence (ICTAI). IEEE, pp 968–974
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning, pp 1188–1196
Lee J, Cha S, Lee D, Lee H (2009) Classification of web robots: An empirical study based on over one billion requests. comput Secur 28(8):795–802
Networks D (2019) 2019 bad bot report. https://resources.distilnetworks.com/white-paper-reports/bad-bot-report-2019, (Last accessed 11-June-2019)
Rude H N, Doran D (2015) Request type prediction for web robot and internet of things traffic. In: 2015 IEEE 14Th international conference on machine learning and applications (ICMLA). IEEE, pp 995–1000
Stassopoulou A, Dikaiakos MD (2007) A probabilistic reasoning approach for discovering web crawler sessions. In: Advances in Data and Web Management. Springer, pp 265–272
Stassopoulou A, Dikaiakos M D (2009) Web robot detection: a probabilistic reasoning approach. Comput Netw 53(3):265–278
Stevanovic D, An A, Vlajic N (2012) Feature evaluation for web crawler detection with data mining techniques. Expert Syst Appl 39(10):8707–8717
Stevanovic D, Vlajic N, An A (2013) Detection of malicious and non-malicious website visitors using unsupervised neural network learning. Appl Soft Comput 13(1):698–708
Suchacka G, Sobkow M (2015) Detection of internet robots using a bayesian approach. In: 2015 IEEE 2Nd international conference on cybernetics (CYBCONF). IEEE, pp 365–370
Tan PN, Kumar V (2004) Discovery of web robot sessions based on their navigational patterns. In: Intelligent Technologies for Information Analysis. Springer, pp 193–222
Zabihi M, Jahan MV, Hamidzadeh J (2014) A density based clustering approach for web robot detection. Proceedings of the 4th International Conference on Computer and Knowledge Engineering, ICCKE 2014, pp 23–28. https://doi.org/10.1109/ICCKE.2014.6993362
Zabihimayvan M, Doran D (2018) Some (non-) universal features of web robot traffic. In: 2018 52Nd annual conference on information sciences and systems (CISS). IEEE, pp 1–6
Zabihimayvan M, Sadeghi R, Rude H N, Doran D (2017) A soft computing approach for benign and malicious web robot detection. Expert Syst Appl 87:129–140
Acknowledgements
The authors would like to thank Theodoros Theodoropoulos and Aikaterini Nasta from Aristotle University’s Central Library for their overall help on providing the data.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This research is co-financed by Greece and the European Union (European Social Fund- ESF) through the Operational Programme Human Resources Development, Education and Lifelong Learning in the context of the project Strengthening Human Resources Research Potential via Doctorate Research (MIS-5000432), implemented by the State Scholarships Foundation (IKY).
Rights and permissions
About this article
Cite this article
Lagopoulos, A., Tsoumakas, G. Content-aware web robot detection. Appl Intell 50, 4017–4028 (2020). https://doi.org/10.1007/s10489-020-01754-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-020-01754-9