Abstract
Recently, we noticed the emergence of several data management architectures to cope with the challenges imposed by big data. Among them, data lakehouses are receiving much interest from industrial and academic fields due to their ability to hold disparate multi-structured batch and streaming data sources in a single data repository. Thus, the heterogeneous and complex aspect of the data requires a dedicated process to improve their quality and retrieve value from them. Therefore, data curation encompasses several tasks that clean and enrich data to ensure it continues to fit the user requirements. Nevertheless, most existing data curation approaches need more dynamics, flexibility, and customization in constituting the data curation pipeline to align with end user requirements that may vary according to her/his decision context. Moreover, they are dedicated to curating only a single type of structure of batch data sources (e.g., semi-structured). Considering the changing requirements of the user and the need to build a customized data curation pipeline according to the users and the data source characteristics, we propose a service-based framework for adaptive data curation in data lakehouses that encompasses five modules: data collection, data quality evaluation, data characterization, curation service composition, and data curation. The proposed framework is built upon new data characterization and evaluation modular ontology and a curation service composition approach that we detail in the following paper. The experimental findings validate the contributions’ performance in terms of effectiveness and execution time.











Similar content being viewed by others
Availability of data and materials
This declaration is not applicable.
Notes
Data analysis is out of the scope of the present paper.
References
Hlupić, T., Oreščanin, D., Ružak, D., Baranović, M.: An overview of current data lake architecture models. pp. 1082–1087 (2022) https://doi.org/10.23919/MIPRO55190.2022.9803717
Lord, P., Macdonald, A., Lyon, L., Giaretta, D.: From data deluge to data curation. In: In Proc 3th UK e-Science All Hands Meeting. pp. 371–375 (2004)
Akoka, J., Comyn-Wattiau, I., Laoufi, N.: Research on Big Data - A systematic mapping study. Computer Standards and Interfaces. 54, 105–115 (2017)
Tempini, N.: Data curation-research: Practices of data standardization and exploration in a precision medicine database. New Genet. Soc. 40 (2020)
Beheshti, A., Vaghani, K., Benatallah, B., Tabebordbar, A.: Crowdcorrect: A curation pipeline for social data cleansing and curation. Inf. Syst. Big Data Era, 24–38 (2018)
Konstantinou, N., Abel, E., Bellomarini, L., Bogatu, A., Civili, C., Irfanie, E., Koehler, M., Mazilu, L., Sallinger, E., Fernandes, A.A.A., Gottlob, G., Keane, J.A., Paton, N.W.: VADA: an architecture for end user informed data preparation. J Big Data. 6(1), 1–32 (2019)
Maccioni, A., Torlone, R.: Kayak: A framework for just-in-time data preparation in a data lake. Adv. Inform. Syst. Eng. 474–489 (2018)
Bellomarini, L., Fayzrakhmanov, R.R., Gottlob, G., Kravchenko, A., Laurenza, E., Nenov, Y., Reissfelder, S., Sallinger, E., Sherkhonov, E., Vahdati, S., Wu, L.: Data science with vadalog: Knowledge graphs with machine learning and reasoning in practice. Futur. Gener. Comput. Syst. 129, 407–422 (2022)
Debattista, J., Lange, C., Auer, S.: daq, an ontology for dataset quality information. CEUR Workshop Proceedings. pp. 1184 (2014)
Lebo, T., Sahoo, S., Mcguinness, D., Belhajjame, K., Cheney, J., Corsar, D., Garijo, D., Soiland-Reyes, S., Zednik, S., Zhao, J.: PROV-O: The PROV Ontology. (2013)
Liu, Z., Xu, Z., Xia, X.: Towards systematic analysis and summary of duv-based dataset usage information. pp. 169–172 (2016) https://doi.org/10.1109/WISA.2016.42
Shin, D., Lee, S., Kang, J., Park, E.: Data catalogue standards based on dcat for transportation data: Dcat-trans. Journal of Korean Society of Transportation. 37, 430–444 (2019). https://doi.org/10.7470/jkst.2019.37.5.430
Haller, A., Janowicz, K., Cox, S., Phuoc, D., Taylor, K., Lefrançois, M.: Semantic Sensor Network Ontology. (2017)
Albertoni, R., Isaac, A.: Introducing the data quality vocabulary (dqv). Semantic Web. 12,(2020). https://doi.org/10.3233/SW-200382
Batini, C., Scannapieco, M.: Erratum to: Data and Information Quality: Dimensions, Principles and Techniques, pp. 1–1 (2016). https://doi.org/10.1007/978-3-319-24106-7_15
Walker, J., Frank, M., Thompson, N.: User centred methods for measuring the value of open data. (2015)
Zouari, F., Ghedira, C., Kabachi, N., Boukadi, K.: Towards an adaptive curation services composition based on machine learning. IEEE International Conference on Web Services (ICWS), 73–78 (2021)
Zouari, F., Ghedira, C., Kabachi, N., Boukadi, K.: A service-based framework for adaptive data curation in data lakehouses. IEEE International Conference on Web Services (ICWS). (2022)
Wang, H., Zhou, X., Zhou, X., Liu, W., Li, W., Bouguettaya, A.: Adaptive service composition based on reinforcement learning. Lecture Notes in Computer Science. 6470 LNCS (60673175), 92–107 (2010)
Szepesvári, C.: Algorithms for Reinforcement Learning 9, 1–89 (2010)
Lauras, M., Truptil, S., Bénaben, F.: Towards a better management of complex emergencies through crisis management meta-modelling. Disasters 39(4), 687–714 (2015)
Sirin, E., Parsia, B.: Pellet: An owl dl reasoner. Description Logics, 212–213 (2004)
Poveda-Villalón, M., Gomez-Perez, A., Suárez-Figueroa, M.C.: Oops!: A pitfall-based system for ontology diagnosis, 120–148 (2018) https://doi.org/10.4018/978-1-5225-5042-6.ch005
Debnath, N.C., Patel, A., Mazumder, D., Manh, P.N., Minh, N.H.: Evaluation of covid-19 ontologies through ontometrics and oops! tools, 351–365 (2022)
Alkhariji, L., De, S., Rana, O., Perera, C.: Semantics-based privacy by design for internet of things applications. Futur. Gener. Comput. Syst. 138, 280–295 (2023). https://doi.org/10.1016/j.future.2022.08.013
Yahya, M., Zhou, B., Zheng, Z., Zhou, D., Breslin, J.G., Ali, M.I., Kharlamov, E.: Towards generalized welding ontology in line with iso and knowledge graph construction, 83–88 (2022)
Lourdusamy, R., John, A.: A review on metrics for ontology evaluation. 2018 2nd International Conference on Inventive Systems and Control (ICISC), 1415–1421 (2018)
Parejo, J., Segura, S., Fernandez, P., Ruiz-Cortés, A.: Qos-aware web services composition using grasp with path relinking. Expert Syst. Appl. 41, 4211–4223 (2014). https://doi.org/10.1016/j.eswa.2013.12.036
Gao, H., Huang, W., Duan, Y.: The cloud-edge-based dynamic reconfiguration to service workflow for mobile ecommerce environments: A qos prediction perspective. ACM Trans. Internet Technol. 21, 1–23 (2021). https://doi.org/10.1145/3391198
Zhang, W., Chang, C.K., Feng, T., Jiang, H.-y.: Qos-based dynamic web service composition with ant colony optimization, 493–502 (2010) https://doi.org/10.1109/COMPSAC.2010.76
Raj, T.F.M., Sivapragasam, P., Balakrishnan, R., Lalithambal, G., Ragasubha, S.: Qos based classification using k-nearest neighbor algorithm for effective web service selection. 2015 IEEE International Conference on Electrical, Computer and Communication Technologies (ICECCT), 1–4 (2015)
Canfora, G., Di Penta, M., Esposito, R., Villani, M.L.: An approach for qos-aware service composition based on genetic algorithms. GECCO 2005-Genetic and Evolutionary Computation Conference. 3387 (2005) https://doi.org/10.1145/1068009.1068189
Acknowledgements
The authors would like to thank Dr. Fatma Guermazi, an oncology physician at Léon-Bérard Center, Lyon, France, for agreeing to validate the effectiveness of the medical enrichment proposed by our data curation approach.
Funding
No funding.
Author information
Authors and Affiliations
Contributions
F. Z. and C.G.G. wrote and reviewed the main manuscript text, and K.B. and N.K. participated in elaborating the scientific contribution.
Corresponding authors
Ethics declarations
Competing interests
The authors have no competing interests as defined by Springer, or other interests that might be perceived to influence the results and/or discussion reported in this paper.
Ethical Approval
This declaration is not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article belongs to the Topical Collection: Special Issue on Web Information Systems Engineering 2022 Guest Editors: Richard Chbeir, Helen Huang, Yannis Manolopoulos and Fabrizio Silvestri .
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zouari, F., Ghedira-Guegan, C., Boukadi, K. et al. A semantic and service-based approach for adaptive mutli-structured data curation in data lakehouses. World Wide Web 26, 4001–4023 (2023). https://doi.org/10.1007/s11280-023-01218-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-023-01218-3