Abstract
In this paper, we tackle the task of document fraud detection. We consider that this task can be addressed with natural language processing techniques. We treat it as a regression-based approach, by taking advantage of a pre-trained language model in order to represent the textual content, and by enriching the representation with domain-specific ontology-based entities and relations. We emulate an entity-based approach by comparing different types of input: raw text, extracted entities and a triple-based reformulation of the document content. For our experimental setup, we utilize the single freely available dataset of forged receipts, and we provide a deep analysis of our results in regard to the efficiency of our methods. Our findings show interesting correlations between the types of ontology relations (e.g., has_address, amounts_to), types of entities (product, company, etc.) and the performance of a regression-based language model that could help to study the transfer learning from natural language processing (NLP) methods to boost the performance of existing fraud detection systems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The platform is available at https://receipts.univ-lr.fr/.
- 2.
- 3.
References
Abramova, S., et al.: Detecting copy-move forgeries in scanned text documents. Electron. Imaging 2016(8), 1–9 (2016)
Ahmed, A.G.H., Shafait, F.: Forgery detection based on intrinsic document contents. In: 2014 11th IAPR International Workshop on Document Analysis Systems, pp. 252–256 (2014)
Artaud, C., Doucet, A., Ogier, J.M., d’Andecy, V.P.: Receipt dataset for fraud detection. In: First International Workshop on Computational Document Forensics (2017)
Artaud, C., Sidère, N., Doucet, A., Ogier, J.M., Yooz, V.P.D.: Find it! fraud detection contest report. In: 2018 24th International Conference on Pattern Recognition (ICPR), pp. 13–18 (2018)
Artaud, C.: Détection des fraudes : de l’image à la sémantique du contenu. Application à la vérification des informations extraites d’un corpus de tickets de caisse, PhD Thesis, University of La Rochelle (2019)
Behera, T.K., Panigrahi, S.: Credit card fraud detection: a hybrid approach using fuzzy clustering & neural network. In: 2015 Second International Conference on Advances in Computing and Communication Engineering (2015)
Benchaji, I., Douzi, S., El Ouahidi, B.: Using genetic algorithm to improve classification of imbalanced datasets for credit card fraud detection. In: International Conference on Advanced Information Technology, Services and Systems (2018)
Bertrand, R., Gomez-Krämer, P., Terrades, O.R., Franco, P., Ogier, J.M.: A system based on intrinsic features for fraudulent document detection. In: 2013 12th International Conference on Document Analysis and Recognition, pp. 106–110. Washington, DC (2013)
Bertrand, R., Terrades, O.R., Gomez-Krämer, P., Franco, P., Ogier, J.M.: A conditional random field model for font forgery detection. In: 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 576–580 (2015)
Boros, E., Moreno, J., Doucet, A.: Event detection with entity markers. In: European Conference on Information Retrieval, pp. 233–240 (2021)
Boros, E., Moreno, J.G., Doucet, A.: Exploring entities in event detection as question answering. In: Hagen, M., et al. (eds.) ECIR 2022. LNCS, vol. 13185, pp. 65–79. Springer, Cham (2022). https://doi.org/10.1007/978-3-030-99736-6_5
Carta, S., Fenu, G., Recupero, D.R., Saia, R.: Fraud detection for e-commerce transactions by employing a prudential multiple consensus model. J. Inf. Secur. Appl. 46, 13–22 (2019)
Cozzolino, D., Gragnaniello, D., Verdoliva, L.: Image forgery detection through residual-based local descriptors and block-matching. In: 2014 IEEE International Conference on Image Processing (ICIP) (2014)
Cozzolino, D., Poggi, G., Verdoliva, L.: Efficient dense-field copy-move forgery detection. IEEE Trans. Inf. Forensics Secur. 10(11), 2284–2297 (2015)
Cozzolino, D., Verdoliva, L.: Camera-based image forgery localization using convolutional neural networks. In: 2018 26th European Signal Processing Conference (EUSIPCO) (2018)
Cozzolino, D., Verdoliva, L.: Noiseprint: A CNN-based camera model fingerprint. IEEE Trans. Inf. Forensics Secur. 15, 144–159 (2020)
Cruz, F., Sidere, N., Coustaty, M., d’Andecy, V.P., Ogier, J.M.: Local binary patterns for document forgery detection. In: 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1 (2017)
Cruz, F., Sidère, N., Coustaty, M., Poulain D’Andecy, V., Ogier, J.: Categorization of document image tampering techniques and how to identify them. In: Pattern Recognition and Information Forensics - ICPR 2018 International Workshops, CVAUI, IWCF, and MIPPSNA, Revised Selected Papers, pp. 117–124 (2018)
Elkasrawi, S., Shafait, F.: Printer identification using supervised learning for document forgery detection. In: 2014 11th IAPR International Workshop on Document Analysis Systems, pp. 146–150 (2014)
Fridrich, J., Kodovsky, J.: Rich models for steganalysis of digital images. IEEE Trans. Inf. Forensics Secur. 7(3), 868–882 (2012)
Gomez-Krämer, P.: Verifying document integrity. Multimedia Security 2: Biometrics, Video Surveillance and Multimedia Encryption, pp. 59–89 (2022)
Guo, H., Yuan, S., Wu, X.: Logbert: log anomaly detection via bert. In: 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition (2015). 10.48550/ARXIV.1512.03385, https://arxiv.org/abs/1512.03385
James, H., Gupta, O., Raviv, D.: OCR graph features for manipulation detection in documents (2020)
Kim, J., Kim, H.-J., Kim, H.: Fraud detection for job placement using hierarchical clusters-based deep neural networks. Appl. Intell. 49(8), 2842–2861 (2019)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kowshalya, G., Nandhini, M.: Predicting fraudulent claims in automobile insurance. In: 2018 Second International Conference on Inventive Communication and Computational Technologies (ICICCT) (2018)
Lee, Y., Kim, J., Kang, P.: Lanobert: system log anomaly detection based on bert masked language model. arXiv preprint arXiv:2111.09564 (2021)
Li, P., et al.: Selfdoc: self-supervised document representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5652–5660 (2021)
Li, Y., Yan, C., Liu, W., Li, M.: Research and application of random forest model in mining automobile insurance fraud. In: 2016 12th International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery (ICNC-FSKD) (2016)
Liu, Y., et al.: Roberta: A robustly optimized bert pretraining approach. ArXiv abs/1907.11692 (2019)
Martin, L., et al.: CamemBERT: a tasty French language model. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7203–7219. Association for Computational Linguistics (2020). 10.18653/v1/2020.acl-main.645, https://aclanthology.org/2020.acl-main.645
Mikkilineni, A.K., Chiang, P.J., Ali, G.N., Chiu, G.T., Allebach, J.P., Delp III, E.J.: Printer identification based on graylevel co-occurrence features for security and forensic applications. In: Security, Steganography, and Watermarking of Multimedia Contents VII, vol. 5681, pp. 430–440. International Society for Optics and Photonics (2005)
Mishra, A., Ghorpade, C.: Credit card fraud detection on the skewed data using various classification and ensemble techniques. In: 2018 IEEE International Students’ Conference on Electrical, Electronics and Computer Science (SCEECS) (2018)
Moreno, J.G., Boros, E., Doucet, A.: TLR at the NTCIR-15 FinNum-2 task: improving text classifiers for numeral attachment in financial social data. In: Proceedings of the 15th NTCIR Conference on Evaluation of Information Access Technologies, Tokyo Japan, pp. 8–11 (2020)
Nadim, A.H., Sayem, I.M., Mutsuddy, A., Chowdhury, M.S.: Analysis of machine learning techniques for credit card fraud detection. In: 2019 International Conference on Machine Learning and Data Engineering (iCMLDE), pp. 42–47 (2019)
Nigrini, M.J.: Benford’s Law: Applications for Forensic Accounting, Auditing, and Fraud Detection, vol. 586. Wiley (2012)
Rabah, C.B., Coatrieux, G., Abdelfattah, R.: The supatlantique scanned documents database for digital image forensics purposes. In: 2020 IEEE International Conference on Image Processing (ICIP) (2020)
Rizki, A.A., Surjandari, I., Wayasti, R.A.: Data mining application to detect financial fraud in indonesia’s public companies. In: 2017 3rd International Conference on Science in Information Technology (ICSITech) (2017)
Rossi, A., Firmani, D., Matinata, A., Merialdo, P., Barbosa, D.: Knowledge graph embedding for link prediction: a comparative analysis. ACM Trans. Knowl. Discov. Data 15(2), 14:1-14:49 (2021)
Shang, S., Kong, X., You, X.: Document forgery detection using distortion mutation of geometric parameters in characters. J. Electron. Imaging 24(2), 023008 (2015)
Sidere, N., Cruz, F., Coustaty, M., Ogier, J.M.: A dataset for forgery detection and spotting in document images. In: 2017 Seventh International Conference on Emerging Security Technologies (EST) (2017)
Tornés, B.M., Boros, E., Doucet, A., Gomez-Krämer, P., Ogier, J.M., d’Andecy, V.P.: Knowledge-based techniques for document fraud detection: a comprehensive study. In: Gelbukh, A. (eds.) Computational Linguistics and Intelligent Text Processing. CICLing 2019. Lecture Notes in Computer Science, vol. 13451, pp. 17–33. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-24337-0_2
Van Beusekom, J., Shafait, F., Breuel, T.M.: Text-line examination for document forgery detection. Int. J. Doc. Anal. Recogn. (IJDAR) 16(2), 189–207 (2013)
Vaswani, A., et al.: Attention is all you need. Advances in Neural Information Processing Systems 30 (2017)
Vidros, S., Kolias, C., Kambourakis, G., Akoglu, L.: Automatic detection of online recruitment frauds: characteristics, methods, and a public dataset. Future Internet 9(1), 6 (2017)
Xu, Y., et al.: Layoutlmv2: multi-modal pre-training for visually-rich document understanding. In: ACL-IJCNLP 2021 (2021)
Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: Layoutlm: pre-training of text and layout for document image understanding. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2020)
Acknowledgements
This work was supported by the French defence innovation agency (AID), the VERINDOC project funded by the Nouvelle-Aquitaine Region.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Tornés, B.M., Boros, E., Doucet, A., Gomez-Krämer, P., Ogier, JM. (2023). Detecting Forged Receipts with Domain-Specific Ontology-Based Entities & Relations. In: Fink, G.A., Jain, R., Kise, K., Zanibbi, R. (eds) Document Analysis and Recognition - ICDAR 2023. ICDAR 2023. Lecture Notes in Computer Science, vol 14189. Springer, Cham. https://doi.org/10.1007/978-3-031-41682-8_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-41682-8_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41681-1
Online ISBN: 978-3-031-41682-8
eBook Packages: Computer ScienceComputer Science (R0)